survey

Density-based Algorithms for Big Data Clustering Using MapReduce Framework: A Comprehensive Study

Authors:

Ghazi Al-NaymatAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 53, Issue 5

Article No.: 93, Pages 1 - 38

https://doi.org/10.1145/3403951

Published: 28 September 2020 Publication History

Abstract

Clustering is used to extract hidden patterns and similar groups from data. Therefore, clustering as a method of unsupervised learning is a crucial technique for big data analysis owing to the massive number of unlabeled objects involved. Density-based algorithms have attracted research interest, because they help to better understand complex patterns in spatial datasets that contain information about data related to co-located objects. Big data clustering is a challenging task, because the volume of data increases exponentially. However, clustering using MapReduce can help answer this challenge. In this context, density-based algorithms in MapReduce have been largely investigated in the past decade to eliminate the problem of big data clustering. Despite the diversity of the algorithms proposed, the field lacks a structured review of the available algorithms and techniques for desirable partitioning, local clustering, and merging. This study formalizes the problem of density-based clustering using MapReduce, proposes a taxonomy to categorize the proposed algorithms, and provides a systematic and comprehensive comparison of these algorithms according to the partitioning technique, type of local clustering, merging technique, and exactness of their implementations. Finally, the study highlights outstanding challenges and opportunities to contribute to the field of density-based clustering using MapReduce.

References

[1]

Charu C. Aggarwal and Chandan K. Reddy. 2013. Data Clustering: Algorithms and Applications. CRC Press, Boca Raton, FL.

Digital Library

[2]

Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. 1999. Fast algorithms for projected clustering. In ACM SIGMoD Record, Vol. 28. ACM, 61--72.

Digital Library

[3]

Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Vol. 27. ACM.

[4]

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. SIGMOD Rec. 28, 2 (June 1999), 49--60.

Digital Library

[5]

Saurabh Arora and Inderveer Chana. 2014. A survey of clustering techniques for big data analysis. In Proceedings of the 2014 5th International Conference-Confluence the Next Generation Information Technology Summit (Confluence’14). IEEE, 59--65.

[6]

A. M. Aryal and S. Wang. 2018. SparkSNN: A density-based clustering algorithm on spark. In Proceedings of the 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA’18). 433--437.

[7]

Satya Jaswanth Badri. 2019. A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets. Int. J. Comput. Appl. (2019), 1--11.

[8]

Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 2174--2178.

Digital Library

[9]

Liang Bai, Xueqi Cheng, Jiye Liang, Huawei Shen, and Yike Guo. 2017. Fast density clustering strategies based on the k-means algorithm. Pattern Recogn. 71 (2017), 375--386.

[10]

Vinod S. Bawane and Sandesha M. Kale. 2015. Clustering algorithms in MapReduce: A review. International Journal of Computer Applications 975 (2015), 8887.

[11]

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD’90). ACM, New York, NY, 322--331.

Digital Library

[12]

Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (September 1975), 509--517.

Digital Library

[13]

Surbhi Bhardwaj and Subrat Kumar Dash. 2015. VDMR-DBSCAN: Varied density MapReduce DBSCAN. In Proceedings of the International Conference on Big Data Analytics. Springer, 134--150.

Digital Library

[14]

Jiang Bian, Umit Topaloglu, and Fan Yu. 2012. Towards large-scale twitter mining for drug-related adverse events. In Proceedings of the 2012 International Workshop on Smart Health and Wellbeing. ACM, 25--32.

Digital Library

[15]

Mohamed Bouguessa. 2015. Clustering categorical data in projected spaces. Data Min. Knowl. Discov. 29, 1 (2015), 3--38.

Digital Library

[16]

Markus M. Breunig, Hans-Peter Kriegel, Peer Kröger, and Jörg Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In ACM SIGMOD Record, Vol. 30. ACM, 79--90.

Digital Library

[17]

Jacques Bughin, Michael Chui, and James Manyika. 2010. Clouds, big data, and smart assets: Ten tech-enabled business trends to watch. McKinsey Quart. 56, 1 (2010), 75--86.

[18]

Eugenio Cesario, Giuseppe Manco, and Riccardo Ortale. 2007. Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. Knowl. Data Eng. 19, 12 (2007), 1607--1624.

Digital Library

[19]

Vineet Chaoji. 2009. Efficient Algorithms for Mining Arbitrary Shaped Clusters. Ph.D. Dissertation.

[20]

I. Chebbi, W. Boulila, N. Mellouli, M. Lamolle, and I. R. Farah. 2018. A comparison of big remote sensing data processing with Hadoop MapReduce and Spark. In Proceedings of the 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP’18). 1--4.

[21]

Yong Chen, Hong Chen, Anjee Gorkhali, Yang Lu, Yiqian Ma, and Ling Li. 2016. Big data analytics and big data science: A survey. J. Manage. Analyt. 3, 1 (2016), 1--42. arXiv:https://doi.org/10.1080/23270012.2016.1141332

[22]

Yixin Chen and Li Tu. 2007. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07). ACM, New York, NY, 133--142.

Digital Library

[23]

Dazhi Chong and Hui Shi. 2015. Big data analytics: A literature review. J. Manage. Analyt. 2, 3 (2015), 175--201. arXiv:https://doi.org/10.1080/23270012.2015.1082449

[24]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms (2nd ed.). The MIT Press.

Digital Library

[25]

Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, and Changqing Ji. 2014. Optimized big data k-means clustering using MapReduce. J. Supercomput. 70, 3 (December 2014), 1249--1259.

Digital Library

[26]

Alfredo Cuzzocrea, Mohamed Medhat Gaber, Staci Lattimer, and Giorgio Mario Grasso. 2016. Clustering-based spatio-temporal analysis of big atmospheric data. In Proceedings of the International Conference on Internet of Things and Cloud Computing (ICC’16). ACM, New York, NY, Article 74, 8 pages.

Digital Library

[27]

B. R. Dai and I. C. Lin. 2012. Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In Proceedings of the 2012 IEEE 5th International Conference on Cloud Computing. 59--66.

[28]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on P-stable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry (SCG’04). ACM, New York, NY, 253--262.

[29]

Fabrício Olivetti de França. 2016. A hash-based co-clustering algorithm for categorical data. Expert Syst. Appl. 64 (2016), 24--35.

Digital Library

[30]

A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, L. Zhao, C. Lu, P. Butler, R. P. Khandpur, Y. Fayed, and N. Ramakrishnan. 2014. The EMBERS architecture for streaming predictive analytics. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data’14). 11--13.

[31]

A. Eldawy, M. F. Mokbel, S. Alharthi, A. Alzaidy, K. Tarek, and S. Ghani. 2015. SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. 1585--1596.

[32]

Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson. 2011. CoHadoop: Flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. 4, 9 (2011), 575--585.

Digital Library

[33]

Levent Ertöz, Michael Steinbach, and Vipin Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM International Conference on Data Mining. SIAM, 47--58.

[34]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226--231.

Digital Library

[35]

A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2, 3 (September 2014), 267--279.

[36]

Ruogu Fang, Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, and S. S. Iyengar. 2016. Computational health informatics in the big data age: A survey. ACM Comput. Surv. 49, 1, Article 12 (June 2016), 36 pages.

[37]

Yan Xiang Fu, Wei Zhong Zhao, and Hui Fang Ma. 2011. Research on parallel DBSCAN algorithm design based on mapreduce. In Advanced Materials Research, Vol. 301. Trans Tech Publications, 1133--1138.

[38]

Volker Gaede and Oliver Günther. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2 (June 1998), 170--231.

Digital Library

[39]

Zhipeng Gao, Yidan Fan, Kun Niu, and Zhenyi Ying. 2018. MR-Mafia: Parallel subspace clustering algorithm based on MapReduce for large multi-dimensional datasets. In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing (BigComp’18). IEEE, 257--262.

[40]

Frank Gouineau, Tom Landry, and Thomas Triplet. 2016. PatchWork, a scalable density-grid clustering algorithm. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC’16). ACM, New York, NY, 824--831.

Digital Library

[41]

Yuhua Gu, Xinyue Ye, Feng Zhang, Zhenhong Du, Renyi Liu, and Lifeng Yu. 2018. A parallel varied density-based clustering algorithm with optimized data partition. J. Spat. Sci. 63, 1 (2018), 93--114.

[42]

Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD’84). ACM, New York, NY, 47--57.

Digital Library

[43]

Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier.

Digital Library

[44]

M. F. Hassanin, M. Hassan, and A. Shoeb. 2015. DDBSCAN: Different densities-based spatial clustering of applications with noise. In Proceedings of the 2015 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT’15). 401--404.

[45]

Xiao He, Jing Feng, Bettina Konte, Son T. Mai, and Claudia Plant. 2014a. Relevant overlapping subspace clusters on categorical data. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 213--222.

Digital Library

[46]

Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. 2014b. MR-DBSCAN: A scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8, 1 (01 February 2014), 83--99.

[47]

Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan. 2011. MR-DBSCAN: An efficient parallel density-based clustering algorithm using MapReduce. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems. 473--480.

Digital Library

[48]

Shaun Hipgrave. 2013. Smarter fraud investigations with big data analytics. Netw. Secur. 2013, 12 (2013), 7--9.

[49]

Behrooz Hosseini and Kourosh Kiani. 2018. A robust distributed big data clustering-based on adaptive density partitioning using apache spark. Symmetry 10, 8 (2018), 342.

[50]

Behrooz Hosseini and Kourosh Kiani. 2019. A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray. Eng. Appl. Artif. Intell. 79 (2019), 100--113.

[51]

Xiaojuan Hu, Lei Liu, Ningjia Qiu, Di Yang, and Meng Li. 2017. A MapReduce-based improvement algorithm for DBSCAN. J. Algor. Comput. Technol. 12, 1 (2017), 53--61.

[52]

R. A. Jarvis and E. A. Patrick. 1973. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22, 11 (Nov. 1973), 1025--1034.

[53]

Chen Jin, Ruoqian Liu, Zhengzhang Chen, William Hendrix, Ankit Agrawal, and Alok Choudhary. 2015. A scalable hierarchical clustering algorithm using spark. In Proceedings of the 2015 IEEE 1st International Conference on Big Data Computing Service and Applications. IEEE, 418--426.

Digital Library

[54]

Hui Jin, Xi Yang, Xian-He Sun, and Ioan Raicu. 2012. Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems. IEEE, 516--525.

Digital Library

[55]

Seong-Yeon Ju, Min-Ho Song, Ga-Ae Ryu, Mihye Kim, and Kwan-Hee Yoo. 2014. Design and implementation of a dynamic educational content viewer with big data analytics functionality. Int. J. Multimedia Ubiq. Eng. 9, 73 (2014).

[56]

Ibrahim Kamel and Christos Faloutsos. 1993. On packing R-trees. In Proceedings of the 2nd International Conference on Information and Knowledge Management (CIKM’93). ACM, New York, NY, 490--499.

Digital Library

[57]

G. Kedma, M. Guri, T. Sela, and Y. Elovici. 2013. Analyzing users’ web surfing patterns to trace terrorists and criminals. In Proceedings of the 2013 IEEE International Conference on Intelligence and Security Informatics. 143--145.

[58]

Younghoon Kim, Kyuseok Shim, Min-Soeng Kim, and June Sup Lee. 2014. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce. Inf. Syst. 42 (2014), 15--35.

Digital Library

[59]

D. Kriksciuniene, M. Liutvinavicius, V. Sakalauskas, and D. Tamasauskas. 2014. Research of customer behavior anomalies in big financial data. In Proceedings of the 2014 14th International Conference on Hybrid Intelligent Systems. 91--96.

[60]

Kyung Mi Lee and Keon Myung Lee. 2013. A locality sensitive hashing technique for categorical data. In Applied Mechanics and Materials, Vol. 241. Trans Tech Publ, 3159--3164.

[61]

S. T. Leutenegger, M. A. Lopez, and J. Edgington. 1997. STR: A simple and efficient algorithm for R-tree packing. In Proceedings 13th International Conference on Data Engineering. 497--506.

[62]

Lixiang Li, Hao Zhang, Haipeng Peng, and Yixian Yang. 2018. Nearest neighbors based density peaks approach to intrusion detection. Chaos Solitons Fract. 110 (2018), 33--40.

[63]

Xuhui Liu, Jizhong Han, Yunqin Zhong, Chengde Han, and Xubin He. 2009. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS. In Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 1--8.

[64]

D. Lopez, M. Gunasekaran, B. S. Murugan, H. Kaur, and K. M. Abbas. 2014. Spatial big data analytics of influenza epidemic in Vellore, India. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data’14). 19--24.

[65]

Jonathan Magnusson. 2012. Social network analysis utilizing big data technology. Uptec. Uppsala Universitet.

[66]

Amin Mohebi, Saeed Aghabozorgi, Teh Ying Wah, Tutut Herawan, and Ramin Yahyapour. 2016. Iterative big data clustering algorithms: A review. Softw.: Pract. Exper. 46, 1 (2016), 107--129.

Digital Library

[67]

A. Munar, E. Chiner, and I. Sales. 2014. A big data financial information management architecture for global banking. In Proceedings of the 2014 International Conference on Future Internet of Things and Cloud. 385--388.

Digital Library

[68]

Aina Musdholifah and Siti Zaiton Mohd Hashim. 2013. Cluster analysis on high-dimensional data: A comparison of density-based clustering algorithms. Austr. J. Basic Appl. Sci. 7, 2 (2013), 380--389.

[69]

Jürg Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1 (1984), 38--71.

Digital Library

[70]

Ning Pang, Jifu Zhang, Chaowei Zhang, and Xiao Qin. 2018. Parallel hierarchical subspace clustering of categorical data. IEEE Trans. Comput. 68, 4 (2018), 542--555.

Digital Library

[71]

Ning Pang, Jifu Zhang, Chaowei Zhang, Xiao Qin, and Jianghui Cai. 2019. PUMA: Parallel subspace clustering of categorical data using multi-attribute weights. Expert Syst. Appl. 126 (2019), 233--245.

[72]

Lance Parsons, Ehtesham Haque, and Huan Liu. 2004. Subspace clustering for high dimensional data: A review. SIGKDD Explor. Newsl. 6, 1 (June 2004), 90--105.

Digital Library

[73]

Md. Mostofa Ali Patwary, Suren Byna, Nadathur Rajagopalan Satish, Narayanan Sundaram, Zarija Lukić, Vadim Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, and Pradeep Dubey. 2015. BD-CATS: Big data clustering at trillion particle scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). ACM, New York, NY, Article 6, 12 pages.

[74]

Shu Peng, Jun Gu, X. Sean Wang, Weixiong Rao, Min Yang, and Yu Cao. 2014. Cost-based optimization of logical partitions for a query workload in a hadoop data warehouse. In Proceedings of the Asia-Pacific Web Conference. Springer, 559--567.

[75]

D. Pennacchioli, M. Coscia, and D. Pedreschi. 2014. Overlap versus partition: Marketing classification and customer profiling in complex networks of products. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering Workshops. 103--110.

[76]

Viju P. Poonthottam and S. D. Madhu Kumar. 2013. A dynamic data placement scheme for hadoop using real-time access patterns. In Proceedings of the 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI’13). IEEE, 225--229.

[77]

R. G. Qiu, K. Wang, S. Li, J. Dong, and M. Xie. 2014. Big data technologies in support of real time capturing and understanding of electric vehicle customers dynamics. In Proceedings of the 2014 IEEE 5th International Conference on Software Engineering and Service Science. 263--267.

[78]

Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Datasets. Cambridge University Press.

[79]

K. Shyam Sunder Reddy and C. Shoba Bindu. 2017. A review on density-based clustering algorithms for big data analysis. In Proceedings of the 2017 International Conference on IoT in Social, Mobile, Analytics and Cloud (I-SMAC’17). IEEE, 123--130.

[80]

Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492--1496.

[81]

S. Rugel. 2014. Improving customer centricity by End-to-End monitoring and big data analytics. In Proceedings of the 2014 Euro Med Telco Conference (EMTC’14). 1--5.

[82]

Joelson Santos, Talat Syed, Murilo Coelho Naldi, Ricardo J. G. B. Campello, and Jörg Sander. 2019. Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data (2019).

[83]

A. Sarma, P. Goyal, S. Kumari, A. Wani, J. S. Challa, S. Islam, and N. Goyal. 2019. μDBSCAN: An exact scalable DBSCAN algorithm for big data exploiting spatial locality. In Proceedings of the 2019 IEEE International Conference on Cluster Computing (CLUSTER’19). 1--11.

[84]

D. Schwalb, M. Faust, J. Krueger, and H. Plattner. 2014. Leveraging in-memory technology for interactive analyses of point-of-sales data. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering Workshops. 97--102.

[85]

Sanja Scitovski. 2018. A density-based clustering algorithm for earthquake zoning. Comput. Geosci. 110, C (Jan. 2018), 90--95.

[86]

Hari Seetha, M. Narasimha Murty, and B. K. Tripathy. 2017. Modern Technologies for Big Data Classification and Clustering. IGI Global.

[87]

Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the Titans: MapReduce vs. Spark for large scale data analytics. Proc. VLDB Endow. 8, 13 (September 2015), 2110--2121.

Digital Library

[88]

Kyuseok Shim, R. Srikant, and R. Agrawal. 1997. High-dimensional similarity joins. In Proceedings of the 13th International Conference on Data Engineering. 301--311.

[89]

K. Shim, R. Srikant, and R. Agrawal. 2002. High-dimensional similarity joins. IEEE Trans. Knowl. Data Eng. 14, 1 (January 2002), 156--171.

Digital Library

[90]

Ali Seyed Shirkhorshidi, Saeed Aghabozorgi, Teh Ying Wah, and Tutut Herawan. 2014. Big data clustering: A review. In Proceedings of the International Conference on Computational Science and Its Applications. Springer, 707--720.

[91]

Sayali Ashok Shivarkar. 2014. Speed-up extension to Hadoop system. Int. J. Eng. Trends Technol. 12, 2 (2014).

[92]

P. Singh and P. A. Meshram. 2017. Survey of density based clustering algorithms and its variants. In Proceedings of the 2017 International Conference on Inventive Computing and Informatics (ICICI’17). 920--926.

[93]

Hwanjun Song and Jae-Gil Lee. 2018. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). ACM, New York, NY, 1173--1187.

Digital Library

[94]

Anton Spivak and Denis Nasonov. 2016. Data preloading and data placement for MapReduce performance improving. Proc. Comput. Sci. 101 (2016), 379--387.

[95]

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining (1st ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

[96]

Shanjiang Tang, Bingsheng He, Ce Yu, Yusen Li, and Kun Li. 2018. A survey on spark ecosystem for big data processing. arXiv preprint arXiv:1811.08834 (2018).

[97]

Yannis Theodoridis and Timos Sellis. 1996. A model for the prediction of R-tree performance. In Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’96). ACM, New York, NY, 161--171.

Digital Library

[98]

O. Uncu, W. A. Gruver, D. B. Kotak, D. Sabaz, Z. Alibhai, and C. Ng. 2006. GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In Proceedings of the 2006 IEEE International Conference on Systems, Man and Cybernetics, Vol. 4. 2976--2981.

[99]

Chang-Dong Wang and Jian-Huang Lai. 2016. Nonlinear Clustering: Methods and Applications. Springer International Publishing, Cham, 253--302.

[100]

Shen Wang and Haimonti Dutta. 2011. PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework.

[101]

S. Wang and C. F. Eick. 2017. MR-SNN: Design of parallel shared nearest neighbor clustering algorithm using MapReduce. In Proceedings of the 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA’17). 312--315.

[102]

Yiqiu Wang, Yan Gu, and Julian Shun. 2019. Theoretically-efficient and practical parallel DBSCAN. arXiv preprint arXiv:1912.06255 (2019).

[103]

Tom White. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Inc.

Digital Library

[104]

Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. 2010. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing, Workshops and Phd Forum (IPDPSW’10). IEEE, 1--9.

[105]

Tengke Xiong, Shengrui Wang, André Mayers, and Ernest Monga. 2012b. DHCC: Divisive hierarchical clustering of categorical data. Data Min. Knowl. Discov. 24, 1 (2012), 103--135.

Digital Library

[106]

Zhongyang Xiong, Ruotian Chen, Yufang Zhang, and Xuan Zhang. 2012a. Multi-density dbscan algorithm based on density levels partitioning. J. Inf. Comput. Sci. 9, 10 (2012), 2739--2749.

[107]

X. Xu, S. Ding, and T. Sun. 2018. A fast density peaks clustering algorithm based on pre-screening. In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing (BigComp’18). 513--516.

[108]

Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. 1999. A fast parallel clustering algorithm for large spatial databases. In High Performance Data Mining. Springer, 263--290.

[109]

Shohei Yokoyama, Ágnes Bogárdi-Mészöly, and Hiroshi Ishikawa. 2015. EBSCAN: An entanglement-based algorithm for discovering dense regions in large geo-social data streams with noise. In Proceedings of the 8th ACM SIGSPATIAL International Workshop on Location-Based Social Networks (LBSN’15). ACM, New York, NY, Article 7, 10 pages.

Digital Library

[110]

Yanwei Yu, Jindong Zhao, Xiaodong Wang, Qin Wang, and Yonggang Zhang. 2015. Cludoop: An efficient distributed density-based clustering for big data using Hadoop. Int. J. Distrib. Sens. Netw. 11, 6 (2015), 579391.

Digital Library

[111]

Jifu Zhang, Xiaolong Yu, Yonghong Li, Sulan Zhang, Yaling Xun, and Xiao Qin. 2016. A relevant subspace based contextual outlier mining algorithm. Knowl.-Based Syst. 99 (2016), 1--9.

Digital Library

[112]

Y. Zhang, S. Chen, and G. Yu. 2016. Efficient distributed density peaks for clustering large data sets in MapReduce. IEEE Trans. Knowl. Data Eng. 28, 12 (December 2016), 3218--3230.

Digital Library

Cited By

Li NXu ZLi X(2024)Pattern-Moving-Modelling and Analysis Based on Clustered Generalized Cell Mapping for a Class of Complex SystemsProcesses10.3390/pr1203049212:3(492)Online publication date: 28-Feb-2024
https://doi.org/10.3390/pr12030492
Li X(2024)Logistics Distribution Path Optimization Analysis Platform based on Big Data Algorithm2024 International Conference on Data Science and Network Security (ICDSNS)10.1109/ICDSNS62112.2024.10690926(1-5)Online publication date: 26-Jul-2024
https://doi.org/10.1109/ICDSNS62112.2024.10690926
Zeng SWang YYu XSong HGuo XLi Z(2024)A dyeing clustering algorithm based on ant colony path-finding mechanismEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108941136(108941)Online publication date: Oct-2024
https://doi.org/10.1016/j.engappai.2024.108941
Show More Cited By

Index Terms

Density-based Algorithms for Big Data Clustering Using MapReduce Framework: A Comprehensive Study
1. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. MapReduce algorithms

Recommendations

Utilizing the buckshot algorithm for efficient big data clustering in the MapReduce model
PCI '19: Proceedings of the 23rd Pan-Hellenic Conference on Informatics

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are ...
Parallel Implementation of Density Peaks Clustering Algorithm Based on Spark

Clustering algorithm is widely used in data mining. It attempt to classify elements into several clusters, and the elements in the same cluster are more similar to each other meanwhile the elements belonging to other clusters are not similar. The ...
An Attempt at Improving Density-based Clustering Algorithms
ACMSE '19: Proceedings of the 2019 ACM Southeast Conference

Clustering is an unsupervised analytical technique for processing data that works by grouping elements of a set in order to form clusters of similar items. This task lies at the base level of many other tasks including machine vision and artificial ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 53, Issue 5

September 2021

782 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3426973

Editor:
Albert Zomaya
University of Sydney, Austraila

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2020

Accepted: 01 May 2020

Revised: 01 February 2020

Received: 01 July 2019

Published in CSUR Volume 53, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
548
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)9

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li NXu ZLi X(2024)Pattern-Moving-Modelling and Analysis Based on Clustered Generalized Cell Mapping for a Class of Complex SystemsProcesses10.3390/pr1203049212:3(492)Online publication date: 28-Feb-2024
https://doi.org/10.3390/pr12030492
Li X(2024)Logistics Distribution Path Optimization Analysis Platform based on Big Data Algorithm2024 International Conference on Data Science and Network Security (ICDSNS)10.1109/ICDSNS62112.2024.10690926(1-5)Online publication date: 26-Jul-2024
https://doi.org/10.1109/ICDSNS62112.2024.10690926
Zeng SWang YYu XSong HGuo XLi Z(2024)A dyeing clustering algorithm based on ant colony path-finding mechanismEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108941136(108941)Online publication date: Oct-2024
https://doi.org/10.1016/j.engappai.2024.108941
Cao YPokhrel SZhu YDoss RLi G(2024)Automation and Orchestration of Zero Trust Architecture: Potential Solutions and ChallengesMachine Intelligence Research10.1007/s11633-023-1456-221:2(294-317)Online publication date: 25-Jan-2024
https://doi.org/10.1007/s11633-023-1456-2
Zhao LLiu GWei PZhang WSun LQiao P(2024)Data Quality Evaluation Method Based on Density Clustering Algorithm and Its ApplicationFrontier Computing on Industrial Applications Volume 210.1007/978-981-99-9538-7_8(53-60)Online publication date: 23-Jan-2024
https://doi.org/10.1007/978-981-99-9538-7_8
Xie ZZhang JZhang YXu CChen PQu ZSong W(2023)An Efficient Fault Tolerance Strategy for Multi-task MapReduce Models Using Coded Distributed ComputingAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0862-8_16(253-271)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-0862-8_16
Di LYu EDi LYu E(2023)Algorithmic Design Considerations of Big Data AnalyticsRemote Sensing Big Data10.1007/978-3-031-33932-5_11(195-205)Online publication date: 23-Jul-2023
https://doi.org/10.1007/978-3-031-33932-5_11
Cândido PSilva JFaria ENaldi M(2022)Optimization Algorithms for Scalable Stream Batch Clustering with k EstimationApplied Sciences10.3390/app1213646412:13(6464)Online publication date: 25-Jun-2022
https://doi.org/10.3390/app12136464
Al-Naymat GKhader MAl-Betar MHriez RHadi A(2022)MR-VDENCLUE: Varying Density Clustering Using MapReduceIntelligent Systems and Applications10.1007/978-3-031-16072-1_55(771-788)Online publication date: 31-Aug-2022
https://doi.org/10.1007/978-3-031-16072-1_55
Song APeng BQiu JXue YDu M(2021)BSDP: A Novel Balanced Spark Data Partitioner2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS53394.2021.00075(556-566)Online publication date: Dec-2021
https://doi.org/10.1109/ICPADS53394.2021.00075

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents