Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduce

532 Accesses
1 Citations

Abstract

Big data analytics platforms are designed to improve performance by avoiding the extract transfer load approach. Also, there are techniques which have worked very well in performance optimization for relational databases. Yet these techniques are in the process of integration into big data analytics. Indexing and its data structure is an example of such techniques. Despite its popularity in query optimization for efficient data mining, the indexing was not integrated into the MapReduce platform. By design the MapReduce was made to perform a full scan of the input data. However, there were attempts made to incorporate the indexing for performance improvement in MapReduce in recent years. However, these attempts have not exhausted the potentials of indexing in the MapReduce query processing. Consequently, this chapter presents an indexing approach that uses the partitioned B+-Tree as its data structure to index the InputSplit component of the Hadoop distributed file system. This was done to achieve efficient data mining query processing when used with the Hadoop MapReduce. The results of this study showed that the proposed index method has significantly reduced the index size as well as the execution runtime of all search queries by at least 50% for all the used data sizes when compared with the Normal MapReduce processing and another clustered index approach. Thus, the use of the proposed index approach has the potential to significantly reduce the time taken in mining data within a dataset by half.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data Organization and Curation in Big Data

Design of Effective Indexing Technique in Hadoop-Based Database

Experimental Performance Analysis of B+-Trees with Big Data Indexing Potentials

References

Abdullahi, A.U., Ahmad, R., Zakaria, M.N.: Experimental performance analysis of B+-trees with big data indexing potentials. In: International Conference of Reliable Information and Communication Technology, pp. 20–29. Springer (2017)
Google Scholar
An, M., Wang, Y., Wang, W.: Using index in the mapreduce framework. In: Web Conference (APWEB), 2010 12th International Asia-Pacific, pp. 52–58. IEEE (2010)
Google Scholar
B+-tree, B-tree: http://scienceblogs.com/goodmath/2008/07/06/btrees-balancedsearch-trees-f/ (2016)
Cao, J., Han, H., Zhao, M., Ye, S., Zhu, D., Li, L.: An optimized method oftranslating sql to more efficient map-reduce tasks. Int. J. Grid Distrib. Comput. 8(4), 249–256 (2015)
Article Google Scholar
Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 3–14. VLDB Endowment (2007)
Google Scholar
Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques andtechnologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Article Google Scholar
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19(2), 171–209 (2014)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)
Article Google Scholar
Glombiewski, N., Seeger, B., Graefe, G.: Waves of misery after index creation. BTW 2019 (2019)
Google Scholar
Graefe, G.: Sorting and indexing with partitioned b-trees. CIDR 3, 5–8 (2003)
Google Scholar
Graefe, G., Kuno, H.: Self-selecting, self-tuning, incrementally optimized indexes. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 371–381. ACM (2010)
Google Scholar
Hadoop, A.: Apache hadoop. http://hadoop.apache.org/ (2017)
He, J., Yao, S.w., Cai, L., Zhou, W.: Slc-index: A scalable skip list-based indexfor cloud data processing. J. Central South Univ. 25(10), 2438–2450 (2018)
Google Scholar
Hong, Z., Xiao-Ming, W., Jie, C., Yan-Hong, M., Yi-Rong, G., Min, W.: A optimized model for mapreduce based on hadoop. TELKOMNIKA (Telecommunication Computing Electronics and Control) 14(4) (2016)
Google Scholar
Ibrahim, H., Sani, N.F.M., Yaakob, R., et al.: Analyses of indexing techniques onuncertain data with high dimensionality. IEEE Access 8, 74101–74117 (2020)
Article Google Scholar
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR. vol. 7, pp. 7–10 (2017)
Google Scholar
Khasawneh, T.N., AL-Sahlee, M.H., Safia, A.A.: Sql, newsql, and nosql databases: a comparative survey. In: 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 013–021 (2020)
Google Scholar
Lee, S., Jo, J.Y., Kim, Y.: Performance improvement of mapreduce process bypromoting deep data locality. In: Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on, pp. 292–301. IEEE (2016)
Google Scholar
McCreadie, R., Macdonald, C., Ounis, I.: On single-pass indexing with mapreduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 742–743. ACM (2009)
Google Scholar
McCreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: Studyingscalability and efficiency. Inf. Process. Manage. 48(5), 873–888 (2012)
Article Google Scholar
Mofidpoor, M., Shiri, N., Radhakrishnan, T.: Index-based join operations in hive. In: Big Data, 2013 IEEE International Conference on, pp. 26–33. IEEE (2013)
Google Scholar
Philip Chen, C., Zhang, C.Y.: Data-intensive applications, challenges, techniquesand technologies: a survey on big data. Information Sciences 275, 314–347 (2014) 24
Google Scholar
Ramakrishnan, R., Gehrke, J., Gehrke, J.: Database management systems, vol. 3. McGraw-Hill New York (2010)
Google Scholar
Richter, S., Quian´e-Ruiz, J.A., Schuh, S., Dittrich, J.: Towards zero-overhead staticand adaptive indexing in hadoop. VLDB J. 23(3), 469–494 (2014)
Google Scholar
Roy, S., Mitra, R.: A survey of data structures and algorithms used in the contextof compression upon biological sequence. Sustain. Humanosphere 16(1), 1951–1963 (2020)
Google Scholar
Rys, M.: Xml and relational database management systems: inside microsoft sqlserver 2005. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 958–962. ACM (2005)
Google Scholar
Sevugan, P., Shankar, K.: Spatial data indexing and query processing in geocloud. J. Testing and Eval. 47(6) (2019)
Google Scholar
Silberschatz, A., Korth, H.F., Sudarshan, S., et al.: Database system concepts, vol. 4. McGraw-Hill New York (1997)
Google Scholar
Silva, Y.N., Almeida, I., Queiroz, M.: Sql: From traditional databases to big data. In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education, pp. 413–418. ACM (2016)
Google Scholar
Statista: Volume of data worldwide from 2010-2025. https://www.statista.com/statistics/871513/worldwide-data-created/ (2020)
Stewart, R.J., Trinder, P.W., Loidl, H.W.: Comparing high level mapreduce querylanguages. In: Advanced Parallel Processing Technologies, pp. 58–72. Springer (2011)
Google Scholar
Suman, A.K., Gyanchandani, M.: Improved performance of hive using index-basedoperation on big data. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 1974–1978. IEEE (2018)
Google Scholar
Yang, H.C., Parker, D.S.: Traverse: simplified indexing on large map-reduce-mergeclusters. In: International Conference on Database Systems for Advanced Applications, pp. 308–322. Springer (2009)
Google Scholar
Zhang, Q., He, A., Liu, C., Lo, E.: Closest interval join using mapreduce. In: DataScience and Advanced Analytics (DSAA), 2016 IEEE International Conference on, pp. 302–311. IEEE (2016)
Google Scholar
Zikopoulos, P., Eaton, C.: Understanding big data: analytics for enterprise classhadoop and streaming data. McGraw-Hill Osborne Media (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Education Department, Federal College of Education (Tech), Gombe, Nigeria
Ali Usman Abdullahi
Computer and Information Sciences Department, Universiti Teknologi PETRONAS, Seri Iskandar, Perak, Malaysia
Rohiza Ahmad & Nordin M. Zakaria

Authors

Ali Usman Abdullahi
View author publications
You can also search for this author in PubMed Google Scholar
Rohiza Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Nordin M. Zakaria
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Usman Abdullahi .

Editor information

Editors and Affiliations

Mathematical Sciences, Abubakar Tafawa Balewa University, Bauchi, Nigeria
Haruna Chiroma
Information Technology & Cyber Security, Community College Qatar, Doha, Qatar
Shafi’i M. Abdulhamid
School of Computer Science, Harbin Institute of Technology, Shenzhen, China
Philippe Fournier-Viger
Instituto de Telecomunicações, University of Beira Interior, Covilha, Portugal
Nuno M. Garcia

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Abdullahi, A.U., Ahmad, R., Zakaria, N.M. (2021). Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduce. In: Chiroma, H., Abdulhamid, S.M., Fournier-Viger, P., Garcia, N.M. (eds) Machine Learning and Data Mining for Emerging Trend in Cyber Dynamics. Springer, Cham. https://doi.org/10.1007/978-3-030-66288-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-66288-2_9
Published: 02 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66287-5
Online ISBN: 978-3-030-66288-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics