Abstract
Big data analytics platforms are designed to improve performance by avoiding the extract transfer load approach. Also, there are techniques which have worked very well in performance optimization for relational databases. Yet these techniques are in the process of integration into big data analytics. Indexing and its data structure is an example of such techniques. Despite its popularity in query optimization for efficient data mining, the indexing was not integrated into the MapReduce platform. By design the MapReduce was made to perform a full scan of the input data. However, there were attempts made to incorporate the indexing for performance improvement in MapReduce in recent years. However, these attempts have not exhausted the potentials of indexing in the MapReduce query processing. Consequently, this chapter presents an indexing approach that uses the partitioned B+-Tree as its data structure to index the InputSplit component of the Hadoop distributed file system. This was done to achieve efficient data mining query processing when used with the Hadoop MapReduce. The results of this study showed that the proposed index method has significantly reduced the index size as well as the execution runtime of all search queries by at least 50% for all the used data sizes when compared with the Normal MapReduce processing and another clustered index approach. Thus, the use of the proposed index approach has the potential to significantly reduce the time taken in mining data within a dataset by half.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdullahi, A.U., Ahmad, R., Zakaria, M.N.: Experimental performance analysis of B+-trees with big data indexing potentials. In: International Conference of Reliable Information and Communication Technology, pp. 20–29. Springer (2017)
An, M., Wang, Y., Wang, W.: Using index in the mapreduce framework. In: Web Conference (APWEB), 2010 12th International Asia-Pacific, pp. 52–58. IEEE (2010)
B+-tree, B-tree: http://scienceblogs.com/goodmath/2008/07/06/btrees-balancedsearch-trees-f/ (2016)
Cao, J., Han, H., Zhao, M., Ye, S., Zhu, D., Li, L.: An optimized method oftranslating sql to more efficient map-reduce tasks. Int. J. Grid Distrib. Comput. 8(4), 249–256 (2015)
Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 3–14. VLDB Endowment (2007)
Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques andtechnologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19(2), 171–209 (2014)
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)
Glombiewski, N., Seeger, B., Graefe, G.: Waves of misery after index creation. BTW 2019 (2019)
Graefe, G.: Sorting and indexing with partitioned b-trees. CIDR 3, 5–8 (2003)
Graefe, G., Kuno, H.: Self-selecting, self-tuning, incrementally optimized indexes. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 371–381. ACM (2010)
Hadoop, A.: Apache hadoop. http://hadoop.apache.org/ (2017)
He, J., Yao, S.w., Cai, L., Zhou, W.: Slc-index: A scalable skip list-based indexfor cloud data processing. J. Central South Univ. 25(10), 2438–2450 (2018)
Hong, Z., Xiao-Ming, W., Jie, C., Yan-Hong, M., Yi-Rong, G., Min, W.: A optimized model for mapreduce based on hadoop. TELKOMNIKA (Telecommunication Computing Electronics and Control) 14(4) (2016)
Ibrahim, H., Sani, N.F.M., Yaakob, R., et al.: Analyses of indexing techniques onuncertain data with high dimensionality. IEEE Access 8, 74101–74117 (2020)
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR. vol. 7, pp. 7–10 (2017)
Khasawneh, T.N., AL-Sahlee, M.H., Safia, A.A.: Sql, newsql, and nosql databases: a comparative survey. In: 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 013–021 (2020)
Lee, S., Jo, J.Y., Kim, Y.: Performance improvement of mapreduce process bypromoting deep data locality. In: Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on, pp. 292–301. IEEE (2016)
McCreadie, R., Macdonald, C., Ounis, I.: On single-pass indexing with mapreduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 742–743. ACM (2009)
McCreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: Studyingscalability and efficiency. Inf. Process. Manage. 48(5), 873–888 (2012)
Mofidpoor, M., Shiri, N., Radhakrishnan, T.: Index-based join operations in hive. In: Big Data, 2013 IEEE International Conference on, pp. 26–33. IEEE (2013)
Philip Chen, C., Zhang, C.Y.: Data-intensive applications, challenges, techniquesand technologies: a survey on big data. Information Sciences 275, 314–347 (2014) 24
Ramakrishnan, R., Gehrke, J., Gehrke, J.: Database management systems, vol. 3. McGraw-Hill New York (2010)
Richter, S., Quian´e-Ruiz, J.A., Schuh, S., Dittrich, J.: Towards zero-overhead staticand adaptive indexing in hadoop. VLDB J. 23(3), 469–494 (2014)
Roy, S., Mitra, R.: A survey of data structures and algorithms used in the contextof compression upon biological sequence. Sustain. Humanosphere 16(1), 1951–1963 (2020)
Rys, M.: Xml and relational database management systems: inside microsoft sqlserver 2005. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 958–962. ACM (2005)
Sevugan, P., Shankar, K.: Spatial data indexing and query processing in geocloud. J. Testing and Eval. 47(6) (2019)
Silberschatz, A., Korth, H.F., Sudarshan, S., et al.: Database system concepts, vol. 4. McGraw-Hill New York (1997)
Silva, Y.N., Almeida, I., Queiroz, M.: Sql: From traditional databases to big data. In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education, pp. 413–418. ACM (2016)
Statista: Volume of data worldwide from 2010-2025. https://www.statista.com/statistics/871513/worldwide-data-created/ (2020)
Stewart, R.J., Trinder, P.W., Loidl, H.W.: Comparing high level mapreduce querylanguages. In: Advanced Parallel Processing Technologies, pp. 58–72. Springer (2011)
Suman, A.K., Gyanchandani, M.: Improved performance of hive using index-basedoperation on big data. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 1974–1978. IEEE (2018)
Yang, H.C., Parker, D.S.: Traverse: simplified indexing on large map-reduce-mergeclusters. In: International Conference on Database Systems for Advanced Applications, pp. 308–322. Springer (2009)
Zhang, Q., He, A., Liu, C., Lo, E.: Closest interval join using mapreduce. In: DataScience and Advanced Analytics (DSAA), 2016 IEEE International Conference on, pp. 302–311. IEEE (2016)
Zikopoulos, P., Eaton, C.: Understanding big data: analytics for enterprise classhadoop and streaming data. McGraw-Hill Osborne Media (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Abdullahi, A.U., Ahmad, R., Zakaria, N.M. (2021). Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduce. In: Chiroma, H., Abdulhamid, S.M., Fournier-Viger, P., Garcia, N.M. (eds) Machine Learning and Data Mining for Emerging Trend in Cyber Dynamics. Springer, Cham. https://doi.org/10.1007/978-3-030-66288-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-66288-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66287-5
Online ISBN: 978-3-030-66288-2
eBook Packages: Computer ScienceComputer Science (R0)