Abstract
Processing huge volume of data opened new opportunities in ecommerce, engineering, business and large computing applications. MapReduce programming model is a parallel data processing approach for execution on computer clusters. This model provides an abstraction to design scalable computing algorithm for big data processing. For batch processing types of data processing, MapReduce model provides faster computation. The key/value pair generation of MapReduce program creates memory overhead and deserialization overhead due to data redundancy. Redundancy of data is one of the most important factors that consumes space and affect system performance while using large set of data. This overhead can be avoided considerably by using a novel approach that we developed named Data Triggered Multithreaded Programming (DTMP) model. In this paper, we demonstrate the use of DTMP model using a large dataset with author details and his publications. The Data Triggered Multithreaded Programming can dynamically allocate the resources and can identify the data repetition occurring during computation. DTMP model when applied to the MapReduce programming model brings performance improvement to the system. The major contributions of this work are a simple, scalable and powerful processing of text data that enables automatic parallelization and distribution of large-scale computations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arvind, Nikhil, R.S.: Executing a program on the mit tagged-token dataflow architecture. IEEE Trans. Comput. 300–318 (1990)
Li, F., Ooi, B.C., Tamer Ozsu, M., Wu, S.: Distributed data management using MapReduce. In: ACM Computing Surveys (CSUR), 46(3) (2014)
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop, MapReduce for incremental computations. In: ACM SOCC ’11 (2011)
Tseng, H.-W., Tullsen, D.M.: Data-triggered threads: eliminating redundant computation. In: 17th International Symposium on High Performance Computer Architecture, pp. 181–192 (2011)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: ACM Proceedings, pp. 107–113, Jan 2008
Arvind, Nikhil, R.S.: Executing a program on the mit tagged-token dataflow architecture. IEEE Trans. Comput. 300–318 (1990)
Cave, V., Zhao, J., Shirako, J., and Sarkar, V.: Habanero-java: the new adventures of old x10. In: Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ ’11, pp. 51–61 (2011)
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ACM SIGARCH Computer Architecture News, pp. 152–163 (2009)
Brunett, S., Thornley, J., Ellenbecker, M.: An initial evaluation of the tera multithreaded architecture and programming system using the the c3i parallel benchmark suite. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC 1998), pp. 1–19 (1998)
Lewis, B., Berg, D.J.: Multithreaded Programming with Pthreads. Prentice Hall (1998)
Hammer, M.A., Acar, U.A., Chen, Y.: CEAL: A C-based language for self-adjusting computation. In: ACM SIGPLAN 2009 Conference on Programming Language Design and Implementation, pp. 25–37 (2009)
Steffan, J., Colohan, C., Zhai A., Mowry, T.: A scalable approach to thread-level speculation. In: 27th Annual International Symposium on Computer Architecture, pp. 1–12 (2000)
Lin, J., Chris, D.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3, 1–177 (2010)
Tseng, H.-W., Tullsen, D.M.: Data-triggered multithreading for near-data processing. In: 1st Workshop on Near-Data Processing (WoNDP) (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Sandhya, N., Samuel, P. (2016). Data Centric Text Processing Using MapReduce. In: Snášel, V., Abraham, A., Krömer, P., Pant, M., Muda, A. (eds) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, vol 424. Springer, Cham. https://doi.org/10.1007/978-3-319-28031-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-28031-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28030-1
Online ISBN: 978-3-319-28031-8
eBook Packages: EngineeringEngineering (R0)