Nothing Special   »   [go: up one dir, main page]

skip to main content
survey

Density-based Algorithms for Big Data Clustering Using MapReduce Framework: A Comprehensive Study

Published: 28 September 2020 Publication History

Abstract

Clustering is used to extract hidden patterns and similar groups from data. Therefore, clustering as a method of unsupervised learning is a crucial technique for big data analysis owing to the massive number of unlabeled objects involved. Density-based algorithms have attracted research interest, because they help to better understand complex patterns in spatial datasets that contain information about data related to co-located objects. Big data clustering is a challenging task, because the volume of data increases exponentially. However, clustering using MapReduce can help answer this challenge. In this context, density-based algorithms in MapReduce have been largely investigated in the past decade to eliminate the problem of big data clustering. Despite the diversity of the algorithms proposed, the field lacks a structured review of the available algorithms and techniques for desirable partitioning, local clustering, and merging. This study formalizes the problem of density-based clustering using MapReduce, proposes a taxonomy to categorize the proposed algorithms, and provides a systematic and comprehensive comparison of these algorithms according to the partitioning technique, type of local clustering, merging technique, and exactness of their implementations. Finally, the study highlights outstanding challenges and opportunities to contribute to the field of density-based clustering using MapReduce.

References

[1]
Charu C. Aggarwal and Chandan K. Reddy. 2013. Data Clustering: Algorithms and Applications. CRC Press, Boca Raton, FL.
[2]
Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. 1999. Fast algorithms for projected clustering. In ACM SIGMoD Record, Vol. 28. ACM, 61--72.
[3]
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Vol. 27. ACM.
[4]
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. SIGMOD Rec. 28, 2 (June 1999), 49--60.
[5]
Saurabh Arora and Inderveer Chana. 2014. A survey of clustering techniques for big data analysis. In Proceedings of the 2014 5th International Conference-Confluence the Next Generation Information Technology Summit (Confluence’14). IEEE, 59--65.
[6]
A. M. Aryal and S. Wang. 2018. SparkSNN: A density-based clustering algorithm on spark. In Proceedings of the 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA’18). 433--437.
[7]
Satya Jaswanth Badri. 2019. A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets. Int. J. Comput. Appl. (2019), 1--11.
[8]
Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 2174--2178.
[9]
Liang Bai, Xueqi Cheng, Jiye Liang, Huawei Shen, and Yike Guo. 2017. Fast density clustering strategies based on the k-means algorithm. Pattern Recogn. 71 (2017), 375--386.
[10]
Vinod S. Bawane and Sandesha M. Kale. 2015. Clustering algorithms in MapReduce: A review. International Journal of Computer Applications 975 (2015), 8887.
[11]
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD’90). ACM, New York, NY, 322--331.
[12]
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (September 1975), 509--517.
[13]
Surbhi Bhardwaj and Subrat Kumar Dash. 2015. VDMR-DBSCAN: Varied density MapReduce DBSCAN. In Proceedings of the International Conference on Big Data Analytics. Springer, 134--150.
[14]
Jiang Bian, Umit Topaloglu, and Fan Yu. 2012. Towards large-scale twitter mining for drug-related adverse events. In Proceedings of the 2012 International Workshop on Smart Health and Wellbeing. ACM, 25--32.
[15]
Mohamed Bouguessa. 2015. Clustering categorical data in projected spaces. Data Min. Knowl. Discov. 29, 1 (2015), 3--38.
[16]
Markus M. Breunig, Hans-Peter Kriegel, Peer Kröger, and Jörg Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In ACM SIGMOD Record, Vol. 30. ACM, 79--90.
[17]
Jacques Bughin, Michael Chui, and James Manyika. 2010. Clouds, big data, and smart assets: Ten tech-enabled business trends to watch. McKinsey Quart. 56, 1 (2010), 75--86.
[18]
Eugenio Cesario, Giuseppe Manco, and Riccardo Ortale. 2007. Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. Knowl. Data Eng. 19, 12 (2007), 1607--1624.
[19]
Vineet Chaoji. 2009. Efficient Algorithms for Mining Arbitrary Shaped Clusters. Ph.D. Dissertation.
[20]
I. Chebbi, W. Boulila, N. Mellouli, M. Lamolle, and I. R. Farah. 2018. A comparison of big remote sensing data processing with Hadoop MapReduce and Spark. In Proceedings of the 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP’18). 1--4.
[21]
Yong Chen, Hong Chen, Anjee Gorkhali, Yang Lu, Yiqian Ma, and Ling Li. 2016. Big data analytics and big data science: A survey. J. Manage. Analyt. 3, 1 (2016), 1--42. arXiv:https://doi.org/10.1080/23270012.2016.1141332
[22]
Yixin Chen and Li Tu. 2007. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07). ACM, New York, NY, 133--142.
[23]
Dazhi Chong and Hui Shi. 2015. Big data analytics: A literature review. J. Manage. Analyt. 2, 3 (2015), 175--201. arXiv:https://doi.org/10.1080/23270012.2015.1082449
[24]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms (2nd ed.). The MIT Press.
[25]
Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, and Changqing Ji. 2014. Optimized big data k-means clustering using MapReduce. J. Supercomput. 70, 3 (December 2014), 1249--1259.
[26]
Alfredo Cuzzocrea, Mohamed Medhat Gaber, Staci Lattimer, and Giorgio Mario Grasso. 2016. Clustering-based spatio-temporal analysis of big atmospheric data. In Proceedings of the International Conference on Internet of Things and Cloud Computing (ICC’16). ACM, New York, NY, Article 74, 8 pages.
[27]
B. R. Dai and I. C. Lin. 2012. Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In Proceedings of the 2012 IEEE 5th International Conference on Cloud Computing. 59--66.
[28]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on P-stable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry (SCG’04). ACM, New York, NY, 253--262.
[29]
Fabrício Olivetti de França. 2016. A hash-based co-clustering algorithm for categorical data. Expert Syst. Appl. 64 (2016), 24--35.
[30]
A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, L. Zhao, C. Lu, P. Butler, R. P. Khandpur, Y. Fayed, and N. Ramakrishnan. 2014. The EMBERS architecture for streaming predictive analytics. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data’14). 11--13.
[31]
A. Eldawy, M. F. Mokbel, S. Alharthi, A. Alzaidy, K. Tarek, and S. Ghani. 2015. SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. 1585--1596.
[32]
Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson. 2011. CoHadoop: Flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. 4, 9 (2011), 575--585.
[33]
Levent Ertöz, Michael Steinbach, and Vipin Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM International Conference on Data Mining. SIAM, 47--58.
[34]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226--231.
[35]
A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2, 3 (September 2014), 267--279.
[36]
Ruogu Fang, Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, and S. S. Iyengar. 2016. Computational health informatics in the big data age: A survey. ACM Comput. Surv. 49, 1, Article 12 (June 2016), 36 pages.
[37]
Yan Xiang Fu, Wei Zhong Zhao, and Hui Fang Ma. 2011. Research on parallel DBSCAN algorithm design based on mapreduce. In Advanced Materials Research, Vol. 301. Trans Tech Publications, 1133--1138.
[38]
Volker Gaede and Oliver Günther. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2 (June 1998), 170--231.
[39]
Zhipeng Gao, Yidan Fan, Kun Niu, and Zhenyi Ying. 2018. MR-Mafia: Parallel subspace clustering algorithm based on MapReduce for large multi-dimensional datasets. In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing (BigComp’18). IEEE, 257--262.
[40]
Frank Gouineau, Tom Landry, and Thomas Triplet. 2016. PatchWork, a scalable density-grid clustering algorithm. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC’16). ACM, New York, NY, 824--831.
[41]
Yuhua Gu, Xinyue Ye, Feng Zhang, Zhenhong Du, Renyi Liu, and Lifeng Yu. 2018. A parallel varied density-based clustering algorithm with optimized data partition. J. Spat. Sci. 63, 1 (2018), 93--114.
[42]
Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD’84). ACM, New York, NY, 47--57.
[43]
Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier.
[44]
M. F. Hassanin, M. Hassan, and A. Shoeb. 2015. DDBSCAN: Different densities-based spatial clustering of applications with noise. In Proceedings of the 2015 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT’15). 401--404.
[45]
Xiao He, Jing Feng, Bettina Konte, Son T. Mai, and Claudia Plant. 2014a. Relevant overlapping subspace clusters on categorical data. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 213--222.
[46]
Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. 2014b. MR-DBSCAN: A scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8, 1 (01 February 2014), 83--99.
[47]
Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan. 2011. MR-DBSCAN: An efficient parallel density-based clustering algorithm using MapReduce. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems. 473--480.
[48]
Shaun Hipgrave. 2013. Smarter fraud investigations with big data analytics. Netw. Secur. 2013, 12 (2013), 7--9.
[49]
Behrooz Hosseini and Kourosh Kiani. 2018. A robust distributed big data clustering-based on adaptive density partitioning using apache spark. Symmetry 10, 8 (2018), 342.
[50]
Behrooz Hosseini and Kourosh Kiani. 2019. A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray. Eng. Appl. Artif. Intell. 79 (2019), 100--113.
[51]
Xiaojuan Hu, Lei Liu, Ningjia Qiu, Di Yang, and Meng Li. 2017. A MapReduce-based improvement algorithm for DBSCAN. J. Algor. Comput. Technol. 12, 1 (2017), 53--61.
[52]
R. A. Jarvis and E. A. Patrick. 1973. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22, 11 (Nov. 1973), 1025--1034.
[53]
Chen Jin, Ruoqian Liu, Zhengzhang Chen, William Hendrix, Ankit Agrawal, and Alok Choudhary. 2015. A scalable hierarchical clustering algorithm using spark. In Proceedings of the 2015 IEEE 1st International Conference on Big Data Computing Service and Applications. IEEE, 418--426.
[54]
Hui Jin, Xi Yang, Xian-He Sun, and Ioan Raicu. 2012. Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems. IEEE, 516--525.
[55]
Seong-Yeon Ju, Min-Ho Song, Ga-Ae Ryu, Mihye Kim, and Kwan-Hee Yoo. 2014. Design and implementation of a dynamic educational content viewer with big data analytics functionality. Int. J. Multimedia Ubiq. Eng. 9, 73 (2014).
[56]
Ibrahim Kamel and Christos Faloutsos. 1993. On packing R-trees. In Proceedings of the 2nd International Conference on Information and Knowledge Management (CIKM’93). ACM, New York, NY, 490--499.
[57]
G. Kedma, M. Guri, T. Sela, and Y. Elovici. 2013. Analyzing users’ web surfing patterns to trace terrorists and criminals. In Proceedings of the 2013 IEEE International Conference on Intelligence and Security Informatics. 143--145.
[58]
Younghoon Kim, Kyuseok Shim, Min-Soeng Kim, and June Sup Lee. 2014. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce. Inf. Syst. 42 (2014), 15--35.
[59]
D. Kriksciuniene, M. Liutvinavicius, V. Sakalauskas, and D. Tamasauskas. 2014. Research of customer behavior anomalies in big financial data. In Proceedings of the 2014 14th International Conference on Hybrid Intelligent Systems. 91--96.
[60]
Kyung Mi Lee and Keon Myung Lee. 2013. A locality sensitive hashing technique for categorical data. In Applied Mechanics and Materials, Vol. 241. Trans Tech Publ, 3159--3164.
[61]
S. T. Leutenegger, M. A. Lopez, and J. Edgington. 1997. STR: A simple and efficient algorithm for R-tree packing. In Proceedings 13th International Conference on Data Engineering. 497--506.
[62]
Lixiang Li, Hao Zhang, Haipeng Peng, and Yixian Yang. 2018. Nearest neighbors based density peaks approach to intrusion detection. Chaos Solitons Fract. 110 (2018), 33--40.
[63]
Xuhui Liu, Jizhong Han, Yunqin Zhong, Chengde Han, and Xubin He. 2009. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS. In Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 1--8.
[64]
D. Lopez, M. Gunasekaran, B. S. Murugan, H. Kaur, and K. M. Abbas. 2014. Spatial big data analytics of influenza epidemic in Vellore, India. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data’14). 19--24.
[65]
Jonathan Magnusson. 2012. Social network analysis utilizing big data technology. Uptec. Uppsala Universitet.
[66]
Amin Mohebi, Saeed Aghabozorgi, Teh Ying Wah, Tutut Herawan, and Ramin Yahyapour. 2016. Iterative big data clustering algorithms: A review. Softw.: Pract. Exper. 46, 1 (2016), 107--129.
[67]
A. Munar, E. Chiner, and I. Sales. 2014. A big data financial information management architecture for global banking. In Proceedings of the 2014 International Conference on Future Internet of Things and Cloud. 385--388.
[68]
Aina Musdholifah and Siti Zaiton Mohd Hashim. 2013. Cluster analysis on high-dimensional data: A comparison of density-based clustering algorithms. Austr. J. Basic Appl. Sci. 7, 2 (2013), 380--389.
[69]
Jürg Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1 (1984), 38--71.
[70]
Ning Pang, Jifu Zhang, Chaowei Zhang, and Xiao Qin. 2018. Parallel hierarchical subspace clustering of categorical data. IEEE Trans. Comput. 68, 4 (2018), 542--555.
[71]
Ning Pang, Jifu Zhang, Chaowei Zhang, Xiao Qin, and Jianghui Cai. 2019. PUMA: Parallel subspace clustering of categorical data using multi-attribute weights. Expert Syst. Appl. 126 (2019), 233--245.
[72]
Lance Parsons, Ehtesham Haque, and Huan Liu. 2004. Subspace clustering for high dimensional data: A review. SIGKDD Explor. Newsl. 6, 1 (June 2004), 90--105.
[73]
Md. Mostofa Ali Patwary, Suren Byna, Nadathur Rajagopalan Satish, Narayanan Sundaram, Zarija Lukić, Vadim Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, and Pradeep Dubey. 2015. BD-CATS: Big data clustering at trillion particle scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). ACM, New York, NY, Article 6, 12 pages.
[74]
Shu Peng, Jun Gu, X. Sean Wang, Weixiong Rao, Min Yang, and Yu Cao. 2014. Cost-based optimization of logical partitions for a query workload in a hadoop data warehouse. In Proceedings of the Asia-Pacific Web Conference. Springer, 559--567.
[75]
D. Pennacchioli, M. Coscia, and D. Pedreschi. 2014. Overlap versus partition: Marketing classification and customer profiling in complex networks of products. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering Workshops. 103--110.
[76]
Viju P. Poonthottam and S. D. Madhu Kumar. 2013. A dynamic data placement scheme for hadoop using real-time access patterns. In Proceedings of the 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI’13). IEEE, 225--229.
[77]
R. G. Qiu, K. Wang, S. Li, J. Dong, and M. Xie. 2014. Big data technologies in support of real time capturing and understanding of electric vehicle customers dynamics. In Proceedings of the 2014 IEEE 5th International Conference on Software Engineering and Service Science. 263--267.
[78]
Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Datasets. Cambridge University Press.
[79]
K. Shyam Sunder Reddy and C. Shoba Bindu. 2017. A review on density-based clustering algorithms for big data analysis. In Proceedings of the 2017 International Conference on IoT in Social, Mobile, Analytics and Cloud (I-SMAC’17). IEEE, 123--130.
[80]
Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492--1496.
[81]
S. Rugel. 2014. Improving customer centricity by End-to-End monitoring and big data analytics. In Proceedings of the 2014 Euro Med Telco Conference (EMTC’14). 1--5.
[82]
Joelson Santos, Talat Syed, Murilo Coelho Naldi, Ricardo J. G. B. Campello, and Jörg Sander. 2019. Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data (2019).
[83]
A. Sarma, P. Goyal, S. Kumari, A. Wani, J. S. Challa, S. Islam, and N. Goyal. 2019. μDBSCAN: An exact scalable DBSCAN algorithm for big data exploiting spatial locality. In Proceedings of the 2019 IEEE International Conference on Cluster Computing (CLUSTER’19). 1--11.
[84]
D. Schwalb, M. Faust, J. Krueger, and H. Plattner. 2014. Leveraging in-memory technology for interactive analyses of point-of-sales data. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering Workshops. 97--102.
[85]
Sanja Scitovski. 2018. A density-based clustering algorithm for earthquake zoning. Comput. Geosci. 110, C (Jan. 2018), 90--95.
[86]
Hari Seetha, M. Narasimha Murty, and B. K. Tripathy. 2017. Modern Technologies for Big Data Classification and Clustering. IGI Global.
[87]
Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the Titans: MapReduce vs. Spark for large scale data analytics. Proc. VLDB Endow. 8, 13 (September 2015), 2110--2121.
[88]
Kyuseok Shim, R. Srikant, and R. Agrawal. 1997. High-dimensional similarity joins. In Proceedings of the 13th International Conference on Data Engineering. 301--311.
[89]
K. Shim, R. Srikant, and R. Agrawal. 2002. High-dimensional similarity joins. IEEE Trans. Knowl. Data Eng. 14, 1 (January 2002), 156--171.
[90]
Ali Seyed Shirkhorshidi, Saeed Aghabozorgi, Teh Ying Wah, and Tutut Herawan. 2014. Big data clustering: A review. In Proceedings of the International Conference on Computational Science and Its Applications. Springer, 707--720.
[91]
Sayali Ashok Shivarkar. 2014. Speed-up extension to Hadoop system. Int. J. Eng. Trends Technol. 12, 2 (2014).
[92]
P. Singh and P. A. Meshram. 2017. Survey of density based clustering algorithms and its variants. In Proceedings of the 2017 International Conference on Inventive Computing and Informatics (ICICI’17). 920--926.
[93]
Hwanjun Song and Jae-Gil Lee. 2018. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). ACM, New York, NY, 1173--1187.
[94]
Anton Spivak and Denis Nasonov. 2016. Data preloading and data placement for MapReduce performance improving. Proc. Comput. Sci. 101 (2016), 379--387.
[95]
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining (1st ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[96]
Shanjiang Tang, Bingsheng He, Ce Yu, Yusen Li, and Kun Li. 2018. A survey on spark ecosystem for big data processing. arXiv preprint arXiv:1811.08834 (2018).
[97]
Yannis Theodoridis and Timos Sellis. 1996. A model for the prediction of R-tree performance. In Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’96). ACM, New York, NY, 161--171.
[98]
O. Uncu, W. A. Gruver, D. B. Kotak, D. Sabaz, Z. Alibhai, and C. Ng. 2006. GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In Proceedings of the 2006 IEEE International Conference on Systems, Man and Cybernetics, Vol. 4. 2976--2981.
[99]
Chang-Dong Wang and Jian-Huang Lai. 2016. Nonlinear Clustering: Methods and Applications. Springer International Publishing, Cham, 253--302.
[100]
Shen Wang and Haimonti Dutta. 2011. PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework.
[101]
S. Wang and C. F. Eick. 2017. MR-SNN: Design of parallel shared nearest neighbor clustering algorithm using MapReduce. In Proceedings of the 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA’17). 312--315.
[102]
Yiqiu Wang, Yan Gu, and Julian Shun. 2019. Theoretically-efficient and practical parallel DBSCAN. arXiv preprint arXiv:1912.06255 (2019).
[103]
Tom White. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Inc.
[104]
Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. 2010. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing, Workshops and Phd Forum (IPDPSW’10). IEEE, 1--9.
[105]
Tengke Xiong, Shengrui Wang, André Mayers, and Ernest Monga. 2012b. DHCC: Divisive hierarchical clustering of categorical data. Data Min. Knowl. Discov. 24, 1 (2012), 103--135.
[106]
Zhongyang Xiong, Ruotian Chen, Yufang Zhang, and Xuan Zhang. 2012a. Multi-density dbscan algorithm based on density levels partitioning. J. Inf. Comput. Sci. 9, 10 (2012), 2739--2749.
[107]
X. Xu, S. Ding, and T. Sun. 2018. A fast density peaks clustering algorithm based on pre-screening. In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing (BigComp’18). 513--516.
[108]
Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. 1999. A fast parallel clustering algorithm for large spatial databases. In High Performance Data Mining. Springer, 263--290.
[109]
Shohei Yokoyama, Ágnes Bogárdi-Mészöly, and Hiroshi Ishikawa. 2015. EBSCAN: An entanglement-based algorithm for discovering dense regions in large geo-social data streams with noise. In Proceedings of the 8th ACM SIGSPATIAL International Workshop on Location-Based Social Networks (LBSN’15). ACM, New York, NY, Article 7, 10 pages.
[110]
Yanwei Yu, Jindong Zhao, Xiaodong Wang, Qin Wang, and Yonggang Zhang. 2015. Cludoop: An efficient distributed density-based clustering for big data using Hadoop. Int. J. Distrib. Sens. Netw. 11, 6 (2015), 579391.
[111]
Jifu Zhang, Xiaolong Yu, Yonghong Li, Sulan Zhang, Yaling Xun, and Xiao Qin. 2016. A relevant subspace based contextual outlier mining algorithm. Knowl.-Based Syst. 99 (2016), 1--9.
[112]
Y. Zhang, S. Chen, and G. Yu. 2016. Efficient distributed density peaks for clustering large data sets in MapReduce. IEEE Trans. Knowl. Data Eng. 28, 12 (December 2016), 3218--3230.

Cited By

View all
  • (2024)Pattern-Moving-Modelling and Analysis Based on Clustered Generalized Cell Mapping for a Class of Complex SystemsProcesses10.3390/pr1203049212:3(492)Online publication date: 28-Feb-2024
  • (2024)Logistics Distribution Path Optimization Analysis Platform based on Big Data Algorithm2024 International Conference on Data Science and Network Security (ICDSNS)10.1109/ICDSNS62112.2024.10690926(1-5)Online publication date: 26-Jul-2024
  • (2024)A dyeing clustering algorithm based on ant colony path-finding mechanismEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108941136(108941)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Density-based Algorithms for Big Data Clustering Using MapReduce Framework: A Comprehensive Study

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 53, Issue 5
    September 2021
    782 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3426973
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 September 2020
    Accepted: 01 May 2020
    Revised: 01 February 2020
    Received: 01 July 2019
    Published in CSUR Volume 53, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Big data
    2. clustering
    3. density clustering
    4. mapreduce framework

    Qualifiers

    • Survey
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)55
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Pattern-Moving-Modelling and Analysis Based on Clustered Generalized Cell Mapping for a Class of Complex SystemsProcesses10.3390/pr1203049212:3(492)Online publication date: 28-Feb-2024
    • (2024)Logistics Distribution Path Optimization Analysis Platform based on Big Data Algorithm2024 International Conference on Data Science and Network Security (ICDSNS)10.1109/ICDSNS62112.2024.10690926(1-5)Online publication date: 26-Jul-2024
    • (2024)A dyeing clustering algorithm based on ant colony path-finding mechanismEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108941136(108941)Online publication date: Oct-2024
    • (2024)Automation and Orchestration of Zero Trust Architecture: Potential Solutions and ChallengesMachine Intelligence Research10.1007/s11633-023-1456-221:2(294-317)Online publication date: 25-Jan-2024
    • (2024)Data Quality Evaluation Method Based on Density Clustering Algorithm and Its ApplicationFrontier Computing on Industrial Applications Volume 210.1007/978-981-99-9538-7_8(53-60)Online publication date: 23-Jan-2024
    • (2023)An Efficient Fault Tolerance Strategy for Multi-task MapReduce Models Using Coded Distributed ComputingAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0862-8_16(253-271)Online publication date: 20-Oct-2023
    • (2023)Algorithmic Design Considerations of Big Data AnalyticsRemote Sensing Big Data10.1007/978-3-031-33932-5_11(195-205)Online publication date: 23-Jul-2023
    • (2022)Optimization Algorithms for Scalable Stream Batch Clustering with k EstimationApplied Sciences10.3390/app1213646412:13(6464)Online publication date: 25-Jun-2022
    • (2022)MR-VDENCLUE: Varying Density Clustering Using MapReduceIntelligent Systems and Applications10.1007/978-3-031-16072-1_55(771-788)Online publication date: 31-Aug-2022
    • (2021)BSDP: A Novel Balanced Spark Data Partitioner2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS53394.2021.00075(556-566)Online publication date: Dec-2021

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media