Nothing Special   »   [go: up one dir, main page]

Skip to main content

Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduce

  • Chapter
  • First Online:
Machine Learning and Data Mining for Emerging Trend in Cyber Dynamics

Abstract

Big data analytics platforms are designed to improve performance by avoiding the extract transfer load approach. Also, there are techniques which have worked very well in performance optimization for relational databases. Yet these techniques are in the process of integration into big data analytics. Indexing and its data structure is an example of such techniques. Despite its popularity in query optimization for efficient data mining, the indexing was not integrated into the MapReduce platform. By design the MapReduce was made to perform a full scan of the input data. However, there were attempts made to incorporate the indexing for performance improvement in MapReduce in recent years. However, these attempts have not exhausted the potentials of indexing in the MapReduce query processing. Consequently, this chapter presents an indexing approach that uses the partitioned B+-Tree as its data structure to index the InputSplit component of the Hadoop distributed file system. This was done to achieve efficient data mining query processing when used with the Hadoop MapReduce. The results of this study showed that the proposed index method has significantly reduced the index size as well as the execution runtime of all search queries by at least 50% for all the used data sizes when compared with the Normal MapReduce processing and another clustered index approach. Thus, the use of the proposed index approach has the potential to significantly reduce the time taken in mining data within a dataset by half.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdullahi, A.U., Ahmad, R., Zakaria, M.N.: Experimental performance analysis of B+-trees with big data indexing potentials. In: International Conference of Reliable Information and Communication Technology, pp. 20–29. Springer (2017)

    Google Scholar 

  2. An, M., Wang, Y., Wang, W.: Using index in the mapreduce framework. In: Web Conference (APWEB), 2010 12th International Asia-Pacific, pp. 52–58. IEEE (2010)

    Google Scholar 

  3. B+-tree, B-tree: http://scienceblogs.com/goodmath/2008/07/06/btrees-balancedsearch-trees-f/ (2016)

  4. Cao, J., Han, H., Zhao, M., Ye, S., Zhu, D., Li, L.: An optimized method oftranslating sql to more efficient map-reduce tasks. Int. J. Grid Distrib. Comput. 8(4), 249–256 (2015)

    Article  Google Scholar 

  5. Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 3–14. VLDB Endowment (2007)

    Google Scholar 

  6. Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques andtechnologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  7. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19(2), 171–209 (2014)

    Article  Google Scholar 

  8. Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  9. Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)

    Article  Google Scholar 

  10. Glombiewski, N., Seeger, B., Graefe, G.: Waves of misery after index creation. BTW 2019 (2019)

    Google Scholar 

  11. Graefe, G.: Sorting and indexing with partitioned b-trees. CIDR 3, 5–8 (2003)

    Google Scholar 

  12. Graefe, G., Kuno, H.: Self-selecting, self-tuning, incrementally optimized indexes. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 371–381. ACM (2010)

    Google Scholar 

  13. Hadoop, A.: Apache hadoop. http://hadoop.apache.org/ (2017)

  14. He, J., Yao, S.w., Cai, L., Zhou, W.: Slc-index: A scalable skip list-based indexfor cloud data processing. J. Central South Univ. 25(10), 2438–2450 (2018)

    Google Scholar 

  15. Hong, Z., Xiao-Ming, W., Jie, C., Yan-Hong, M., Yi-Rong, G., Min, W.: A optimized model for mapreduce based on hadoop. TELKOMNIKA (Telecommunication Computing Electronics and Control) 14(4) (2016)

    Google Scholar 

  16. Ibrahim, H., Sani, N.F.M., Yaakob, R., et al.: Analyses of indexing techniques onuncertain data with high dimensionality. IEEE Access 8, 74101–74117 (2020)

    Article  Google Scholar 

  17. Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR. vol. 7, pp. 7–10 (2017)

    Google Scholar 

  18. Khasawneh, T.N., AL-Sahlee, M.H., Safia, A.A.: Sql, newsql, and nosql databases: a comparative survey. In: 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 013–021 (2020)

    Google Scholar 

  19. Lee, S., Jo, J.Y., Kim, Y.: Performance improvement of mapreduce process bypromoting deep data locality. In: Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on, pp. 292–301. IEEE (2016)

    Google Scholar 

  20. McCreadie, R., Macdonald, C., Ounis, I.: On single-pass indexing with mapreduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 742–743. ACM (2009)

    Google Scholar 

  21. McCreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: Studyingscalability and efficiency. Inf. Process. Manage. 48(5), 873–888 (2012)

    Article  Google Scholar 

  22. Mofidpoor, M., Shiri, N., Radhakrishnan, T.: Index-based join operations in hive. In: Big Data, 2013 IEEE International Conference on, pp. 26–33. IEEE (2013)

    Google Scholar 

  23. Philip Chen, C., Zhang, C.Y.: Data-intensive applications, challenges, techniquesand technologies: a survey on big data. Information Sciences 275, 314–347 (2014) 24

    Google Scholar 

  24. Ramakrishnan, R., Gehrke, J., Gehrke, J.: Database management systems, vol. 3. McGraw-Hill New York (2010)

    Google Scholar 

  25. Richter, S., Quian´e-Ruiz, J.A., Schuh, S., Dittrich, J.: Towards zero-overhead staticand adaptive indexing in hadoop. VLDB J. 23(3), 469–494 (2014)

    Google Scholar 

  26. Roy, S., Mitra, R.: A survey of data structures and algorithms used in the contextof compression upon biological sequence. Sustain. Humanosphere 16(1), 1951–1963 (2020)

    Google Scholar 

  27. Rys, M.: Xml and relational database management systems: inside microsoft sqlserver 2005. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 958–962. ACM (2005)

    Google Scholar 

  28. Sevugan, P., Shankar, K.: Spatial data indexing and query processing in geocloud. J. Testing and Eval. 47(6) (2019)

    Google Scholar 

  29. Silberschatz, A., Korth, H.F., Sudarshan, S., et al.: Database system concepts, vol. 4. McGraw-Hill New York (1997)

    Google Scholar 

  30. Silva, Y.N., Almeida, I., Queiroz, M.: Sql: From traditional databases to big data. In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education, pp. 413–418. ACM (2016)

    Google Scholar 

  31. Statista: Volume of data worldwide from 2010-2025. https://www.statista.com/statistics/871513/worldwide-data-created/ (2020)

  32. Stewart, R.J., Trinder, P.W., Loidl, H.W.: Comparing high level mapreduce querylanguages. In: Advanced Parallel Processing Technologies, pp. 58–72. Springer (2011)

    Google Scholar 

  33. Suman, A.K., Gyanchandani, M.: Improved performance of hive using index-basedoperation on big data. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 1974–1978. IEEE (2018)

    Google Scholar 

  34. Yang, H.C., Parker, D.S.: Traverse: simplified indexing on large map-reduce-mergeclusters. In: International Conference on Database Systems for Advanced Applications, pp. 308–322. Springer (2009)

    Google Scholar 

  35. Zhang, Q., He, A., Liu, C., Lo, E.: Closest interval join using mapreduce. In: DataScience and Advanced Analytics (DSAA), 2016 IEEE International Conference on, pp. 302–311. IEEE (2016)

    Google Scholar 

  36. Zikopoulos, P., Eaton, C.: Understanding big data: analytics for enterprise classhadoop and streaming data. McGraw-Hill Osborne Media (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Usman Abdullahi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Abdullahi, A.U., Ahmad, R., Zakaria, N.M. (2021). Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduce. In: Chiroma, H., Abdulhamid, S.M., Fournier-Viger, P., Garcia, N.M. (eds) Machine Learning and Data Mining for Emerging Trend in Cyber Dynamics. Springer, Cham. https://doi.org/10.1007/978-3-030-66288-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66288-2_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66287-5

  • Online ISBN: 978-3-030-66288-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics