Abstract
The explosive growth in volume, velocity, and diversity of data produced by mobile devices and cloud applications has contributed to the abundance of data or ‘big data.’ Available solutions for efficient data storage and management cannot fulfill the needs of such heterogeneous data where the amount of data is continuously increasing. For efficient retrieval and management, existing indexing solutions become inefficient with the rapidly growing index size and seek time and an optimized index scheme is required for big data. Regarding real-world applications, the indexing issue with big data in cloud computing is widespread in healthcare, enterprises, scientific experiments, and social networks. To date, diverse soft computing, machine learning, and other techniques in terms of artificial intelligence have been utilized to satisfy the indexing requirements, yet in the literature, there is no reported state-of-the-art survey investigating the performance and consequences of techniques for solving indexing in big data issues as they enter cloud computing. The objective of this paper is to investigate and examine the existing indexing techniques for big data. Taxonomy of indexing techniques is developed to provide insight to enable researchers understand and select a technique as a basis to design an indexing mechanism with reduced time and space consumption for BD-MCC. In this study, 48 indexing techniques have been studied and compared based on 60 articles related to the topic. The indexing techniques’ performance is analyzed based on their characteristics and big data indexing requirements. The main contribution of this study is taxonomy of categorized indexing techniques based on their method. The categories are non-artificial intelligence, artificial intelligence, and collaborative artificial intelligence indexing methods. In addition, the significance of different procedures and performance is analyzed, besides limitations of each technique. In conclusion, several key future research topics with potential to accelerate the progress and deployment of artificial intelligence-based cooperative indexing in BD-MCC are elaborated on.
Similar content being viewed by others
References
Gärtner M, Rauber A, Berger H (2013) Bridging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation. Knowl Inf Syst 1–32. doi:10.1007/s10115-013-0678-y
Demirkan H, Delen D (2013) Leveraging the capabilities of service-oriented decision support systems: putting analytics and big data in cloud. Decis Support Syst 55(1):412–421. doi:10.1016/j.dss.2012.05.048
Amer-Yahia S, Doan A, Kleinberg J, Koudas N, Franklin M (2010) Crowds, clouds, and algorithms: exploring the human side of “big data” applications. Paper presented at the proceedings of the 2010 ACM SIGMOD international conference on management of data, Indianapolis, Indiana, USA
Dixon Z, Moxley J (2013) Everything is illuminated: what big data can tell us about teacher commentary. Assess Writ 18(4):241–256. doi:10.1016/j.asw.2013.08.002
Liu W, Peng S, Du W, Wang W, Zeng GS (2014) Security-aware intermediate data placement strategy in scientific cloud workflows. Knowl Inf Syst 41:1–25
Dopazo J (2013) Genomics and transcriptomics in drug discovery. Drug Discov Today 19(2):126–132. doi:10.1016/j.drudis.2013.06.003
Wang J, Wu S, Gao H, Li J, Ooi BC (2010) Indexing multi-dimensional data in a cloud system. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 591–602
Fiore S, D’Anca A, Palazzo C, Foster I, Williams DN, Aloisio G (2013) Ophidia: toward big data analytics for escience. Proc Comput Sci 18:2376–2385. doi:10.1016/j.procs.2013.05.409
Chen J, Chen Y, Du X, Li C, Lu J, Zhao S, Zhou X (2013) Big data challenge: a data management perspective. Front Comput Sci 7(2):157–164. doi:10.1007/s11704-013-3903-7
Wang M, Holub V, Murphy J, O’Sullivan P (2013) High volumes of event stream indexing and efficient multi-keyword searching for cloud monitoring. Future Gener Comput Syst 29(8):1943–1962
Rodríguez-García MÁ, Valencia-García R, García-Sánchez F, Samper-Zapater JJ (2013) Creating a semantically-enhanced cloud services environment through ontology evolution. Future Gener Comput Syst 32:295–306. doi:10.1016/j.future.2013.08.003
Cambazoglu BB, Kayaaslan E, Jonassen S, Aykanat C (2013) A term-based inverted index partitioning model for efficient distributed query processing. ACM Trans Web 7(3):1–23. doi:10.1145/2516633.2516637
Bast H, Celikik M (2013) Efficient fuzzy search in large text collections. ACM Trans Inf Syst 31(2):1–59. doi:10.1145/2457465.2457470
Paul A, Chen B-W, Bharanitharan K, Wang J-F (2013) Video search and indexing with reinforcement agent for interactive multimedia services. ACM Trans Embed Comput Syst 12(2):1–16. doi:10.1145/2423636.2423643
Kadiyala S, Shiri N (2008) A compact multi-resolution index for variable length queries in time series databases. Knowl Inf Syst 15(2):131–147
Wu K, Shoshani A, Stockinger K (2010) Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans Database Syst 35(1):1–52. doi:10.1145/1670243.1670245
Cheng J, Ke Y, Fu AW-C, Yu JX (2011) Fast graph query processing with a low-cost index. VLDB J 20(4):521–539
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. doi:10.1145/505282.505283
Shamshirband S, Anuar NB, Kiah MLM, Patel A (2013) An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique. Eng Appl Artif Intell 26(9):2105–2127. doi:10.1016/j.engappai.2013.04.010
Fan C-Y, Chang P-C, Lin J-J, Hsieh JC (2011) A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification. Appl Soft Comput 11(1):632–644. doi:10.1016/j.asoc.2009.12.023
Chang RM, Kauffman RJ, Kwon Y (2014) Understanding the paradigm shift to computational social science in the presence of big data. Decis Support Syst 63:67–80. doi:10.1016/j.dss.2013.08.008
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S (2015) The rise of “big data” on cloud computing: review and open research issues. Inform Syst 47:98–115. doi:10.1016/j.is.2014.07.006
Katal A, Wazid M, Goudar RH (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth international conference on contemporary computing (IC3), 2013, pp 404–409. doi:10.1109/IC3.2013.6612229
Kaisler S, Armour F, Espinosa JA, Money W (2013) Big data: issues and challenges moving forward. In: 2013 46th Hawaii international conference on system sciences (HICSS), 2013, pp 995–1004. doi:10.1109/HICSS.2013.645
Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J (2014) A spatiotemporal compression based approach for efficient big data processing on Cloud. J Comput Syst Sci 80(8):1563–1583. doi:10.1016/j.jcss.2014.04.022
Philip Chen C, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347
Wang X, Luo X, Liu H (2014) Measuring the veracity of web event via uncertainty. J Syst Softw 1–11. doi:10.1016/j.jss.2014.07.023
LaValle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N (2013) Big data, analytics and the path from insights to value. MIT Sloan Manag Rev 21:21–31
Barbierato E, Gribaudo M, Iacono M (2014) Performance evaluation of NoSQL big-data applications using multi-formalism models. Future Gener Comput Syst 37:345–353. doi:10.1016/j.future.2013.12.036
Zhu X, Huang Z, Cheng H, Cui J, Shen HT (2013) Sparse hashing for fast multimedia search. ACM Trans Inf Syst 31(2):1–24. doi:10.1145/2457465.2457469
Li G, Feng J, Zhou X, Wang J (2011) Providing built-in keyword search capabilities in RDBMS. VLDB J 20(1):1–19
Graefe G (2010) A survey of B-tree locking techniques. ACM Trans Database Syst 35(3):16
Li F, Yi K, Le W (2010) Top-k queries on temporal data. VLDB J 19(5):715–733
Sandu Popa I, Zeitouni K, Oria V, Barth D, Vial S (2011) Indexing in-network trajectory flows. VLDB J 20(5):643–669
Sellis TK, Roussopoulos N, Faloutsos C (1987) The R\(+\)-tree: a dynamic index for multi-dimensional objects. Paper presented at the proceedings of the 13th international conference on very large data bases
Wei L-Y, Hsu Y-T, Peng W-C, Lee W-C (2013) Indexing spatial data in cloud data managements. Pervasive Mobile Comput 1–14. doi:10.1016/j.pmcj.2013.07.001
MacNicol R, French B (2004) Sybase IQ multiplex-designed for analytics. Paper presented at the proceedings of the thirteenth international conference on very large data bases, vol 30, Toronto, Canada
Shang L, Yang L, Wang F, Chan K-P, Hua X-S (2010) Real-time large scale near-duplicate web video retrieval. In: Proceedings of the international conference on multimedia, 2010. ACM, pp 531–540
Chakrabarti S, Pathak A, Gupta M (2011) Index design and query processing for graph conductance search. VLDB J 20(3):445–470. doi:10.1007/s00778-010-0204-8
Wang Y (2008) On contemporary denotational mathematics for computational intelligence. In: Gavrilova ML, Kenneth Tan CJ, Wang Y, Yao Y, Wang G (eds) Transactions on computational science II. Springer, Berlin, pp 6–29
Chen-Yu C, Ta-Cheng W, Jhing-Fa W, Li Pang S (2009) SVM-based state transition framework for dynamical human behavior identification. In: IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 1933–1936. doi:10.1109/ICASSP.2009.4959988
Ohbuchi R, Kobayashi J (2006) Unsupervised learning from a corpus for shape-based 3D model retrieval. Paper presented at the proceedings of the 8th ACM international workshop on multimedia information retrieval, Santa Barbara, CA, USA
Saul LK, Roweis ST (2003) Think globally, fit locally: unsupervised learning of low dimensional manifolds. J Mach Learn Res 4:119–155. doi:10.1162/153244304322972667
He J, Li M, Zhang H-J, Tong H, Zhang C (2004) Manifold-ranking based image retrieval. Paper presented at the proceedings of the 12th annual ACM international conference on multimedia, New York, NY, USA
Bordogna G, Pagani M, Pasi G (2006) A dynamic hierarchical fuzzy clustering algorithm for information filtering. In: Herrera-Viedma E, Pasi G, Crestani F (eds) Soft computing in web information retrieval. Springer, Berlin, pp 3–23
Dittrich J, Blunschi L, Vaz Salles M (2011) MOVIES: indexing moving objects by shooting index images. Geoinformatica 15(4):727–767. doi:10.1007/s10707-011-0122-y
Dillenbourg P, Järvelä S, Fischer F (2009) The evolution of research on computer-supported collaborative learning. In: Balacheff N, Ludvigsen S, de Jong T, Lazonder A, Barnes S (eds) Technology-enhanced learning. Springer, Berlin, pp 3–19
Wai-Tat F (2012) Collaborative indexing and knowledge exploration: a social learning model. IEEE Intell Syst 27:39–46
Wu S, Wang Z, Xia S (2009) Indexing and retrieval of human motion data by a hierarchical tree. Paper presented at the proceedings of the 16th ACM symposium on virtual reality software and technology, Kyoto, Japan
Dieng-Kuntz R, Minier D, Růžička M, Corby F, Corby O, Alamarguy L (2006) Building and using a medical ontology for knowledge management and cooperative work in a health care network. Comput Biol Med 36(7–8):871–892. doi:10.1016/j.compbiomed.2005.04.015
Huang Z, Lu X, Duan H, Zhao C (2012) Collaboration-based medical knowledge recommendation. Artif Intell Med 55(1):13–24
Weng M-F, Chuang Y-Y (2012) Collaborative video reindexing via matrix factorization. ACM Trans Multimed Comput Commun Appl 8(2):23
Effelsberg W (2013) A personal look back at twenty years of research in multimedia content analysis. ACM Trans Multimed Comput Commun Appl 9(1s):43
The ORL Database of Faces. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. Accessed 31 Oct 2014
Data set of NCI. http://discover.nci.nih.gov/datasets.jsp. Accessed 31 Oct 2014
Keogh E, Xi X, Wei L, Ratanamahatana C (2006) The UCR time series dataset. http://www.cs.ucr.edu/~eamonn/time_series_data/
Ongenae F, Claeys M, Dupont T, Kerckhove W, Verhoeve P, Dhaene T, De Turck F (2013) A probabilistic ontology-based platform for self-learning context-aware healthcare applications. Expert Syst Appl 40(18):7629–7646. doi:10.1016/j.eswa.2013.07.038
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. Paper presented at the proceedings of the 8th ACM international workshop on multimedia information retrieval, Santa Barbara, CA, USA
Zhuang Y, Jiang N, Wu Z, Li Q, Chiu DK, Hu H (2013) Efficient and robust large medical image retrieval in mobile cloud computing environment. Inf Sci 263:60–86. doi:10.1016/j.ins.2013.10.013
Wu D, Cong G, Jensen CS (2012) A framework for efficient spatial web object retrieval. VLDB J 21(6):797–822
Maier M, Rattigan M, Jensen D (2011) Indexing network structure with shortest-path trees. ACM Trans Knowl Discov Data 5(3):15
Yeh S-C, Su M-Y, Chen H-H, Lin C-Y (2013) An efficient and secure approach for a cloud collaborative editing. J Netw Comput Appl 36(6):1632–1641. doi:10.1016/j.jnca.2013.05.012
Li F, Hadjieleftheriou M, Kollios G, Reyzin L (2010) Authenticated index structures for aggregation queries. ACM Trans Inf Syst Secur 13(4):1–35. doi:10.1145/1880022.1880026
Qian X, Tagare HD, Fulbright RK, Long R, Antani S (2010) Optimal embedding for shape indexing in medical image databases. Med Image Anal 14(3):243–254. doi:10.1016/j.media.2010.01.001
Hsu W, Lee ML, Ooi BC, Mohanty PK, Teo KL, Xia C (2002) Advanced database technologies in a diabetic healthcare system. Paper presented at the proceedings of the 28th international conference on very large data bases, Hong Kong, China
Yuan D, Mitra P (2013) Lindex: a lattice-based index for graph databases. VLDB J 22(2):229–252. doi:10.1007/s00778-012-0284-8
Sinha RR, Winslett M (2007) Multi-resolution bitmap indexes for scientific data. ACM Trans Database Syst 32(3):16. doi:10.1145/1272743.1272746
Gündem Tİ, Armağan Ö (2006) Efficient storage of healthcare data in XML-based smart cards. Comput Methods Programs Biomed 81(1):26–40. doi:10.1016/j.cmpb.2005.10.007
Wang J, Kumar S, Chang S (2012) Semi-supervised hashing for large scale search. IEEE Trans Pattern Anal Mach Intell 34(12). doi:10.1109/TPAMI.2012.48
Ali ST, Sivaraman V, Ostry D (2013) Authentication of lossy data in body-sensor networks for cloud-based healthcare monitoring. Future Gener Comput Syst 35:80–90. doi:10.1016/j.future.2013.09.007
Thilakanathan D, Chen S, Nepal S, Calvo R, Alem L (2013) A platform for secure monitoring and sharing of generic health data in the Cloud. Future Gener Comput Syst 35:102–113. doi:10.1016/j.future.2013.09.011
Jayaraman U, Prakash S, Gupta P (2013) Use of geometric features of principal components for indexing a biometric database. Math Comput Model 58(1–2):147–164. doi:10.1016/j.mcm.2012.06.005
Kaushik VD, Umarani J, Gupta AK, Gupta AK, Gupta P (2013) An efficient indexing scheme for face database using modified geometric hashing. Neurocomputing 116:208–221. doi:10.1016/j.neucom.2011.12.056
Mehrotra H, Majhi B, Gupta P (2010) Robust iris indexing scheme using geometric hashing of SIFT keypoints. J Netw Comput Appl 33(3):300–313. doi:10.1016/j.jnca.2009.12.005
Ferragina P, Venturini R (2010) The compressed permuterm index. ACM Trans Algorithms 7(1):1–21. doi:10.1145/1868237.1868248
Wang C-H, Jiau HC, Chung P-C, Ssu K-F, Yang T-L, Tsai F-J (2010) A novel indexing architecture for the provision of smart playback functions in collaborative telemedicine applications. Comput Biol Med 40(2):138–148
Richter S, Quiané-Ruiz J-A, Schuh S, Dittrich J (2012) Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:12123480
Lazaridis M, Axenopoulos A, Rafailidis D, Daras P (2013) Multimedia search and retrieval using multimodal annotation propagation and indexing techniques. Sig Process Image Commun 28(4):351–367. doi:10.1016/j.image.2012.04.001
Done B, Khatri P, Done A, Draghici S (2010) Predicting novel human gene ontology annotations using semantic analysis. IEEE/ACM Trans Comput Biol Bioinform 7(1):91–99
Yıldırım H, Chaoji V, Zaki M (2012) GRAIL: a scalable index for reachability queries in very large graphs. VLDB J 21(4):509–534. doi:10.1007/s00778-011-0256-4
Zou Z, Wang Y, Cao K, Qu T, Wang Z (2013) Semantic overlay network for large-scale spatial information indexing. Comput Geosci 57:208–217. doi:10.1016/j.cageo.2013.04.019
Chu WW, Liu Z, Mao W, Zou Q (2005) A knowledge-based approach for retrieving scenario-specific medical text documents. Control Eng Pract 13(9):1105–1121. doi:10.1016/j.conengprac.2004.12.011
van der Spek P, Klusener S (2011) Applying a dynamic threshold to improve cluster detection of LSI. Sci Comput Program 76(12):1261–1274. doi:10.1016/j.scico.2010.12.004
Cuggia M, Mougin F, Beux PL (2005) Indexing method of digital audiovisual medical resources with semantic Web integration. Int J Med Inform 74(2–4):169–177. doi:10.1016/j.ijmedinf.2004.04.027
Komkhao M, Lu J, Li Z, Halang WA (2013) Incremental collaborative filtering based on Mahalanobis distance and fuzzy membership for recommender systems. Int J Gen Syst 42(1):41–66
Leung CHC, Chan WS (2010) Semantic music information retrieval using collaborative indexing and filtering. In: Gelenbe E, Lent R, Sakellari G, Sacan A, Toroslu H, Yazici A (eds) Computer and information sciences, vol 62. Lecture notes in electrical engineering. Springer, Netherlands, pp 345–350. doi:10.1007/978-90-481-9794-1_65
Elleuch N, Zarka M, Ammar AB, Alimi AM (2011) A fuzzy ontology: based framework for reasoning in visual video content analysis and indexing. Paper presented at the proceedings of the eleventh international workshop on multimedia data mining, San Diego, CA, USA
Gacto MJ, Alcala R, Herrera F (2010) Integration of an index to preserve the semantic interpretability in the multiobjective evolutionary rule selection and tuning of linguistic fuzzy systems. IEEE Trans Fuzzy Syst 18(3):515–531. doi:10.1109/TFUZZ.2010.2041008
Pandey S, Voorsluys W, Niu S, Khandoker A, Buyya R (2012) An autonomic cloud environment for hosting ECG data analysis services. Future Gener Comput Syst 28(1):147–154
van Zuylen H (2012) Artificial intelligence applications to critical transportation issues. Transportation Research E-Circular, Transportation Research Board, pp 3–5
Doelitzscher F, Reich C, Knahl M, Passfall A, Clarke N (2012) An agent based business aware incident detection system for cloud environments. J Cloud Comput 1(1):1–19
Russo LM, Navarro G, Oliveira AL (2008) Fully-compressed suffix trees. In: LATIN 2008: Theoretical informatics. Springer, Berlin, pp 362–373
Acknowledgments
The authors would like to thank the University of Malaya for grant “Big Data and Mobile Cloud For Collaborative Experiments”, Project Number: P012C-13AFR and Malaysian Ministry of Higher Education under the University of Malaya High Impact Research Grant “Mobile Cloud Computing: Device and Connectivity”, Project Number: M.C/625/1/HIR/MOE/FCSIT/03.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gani, A., Siddiqa, A., Shamshirband, S. et al. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46, 241–284 (2016). https://doi.org/10.1007/s10115-015-0830-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0830-y