Abstract
Biclustering is a data mining technique that allows us to find groups of rows and columns that are highly correlated in a 2D dataset. Although there exist several software applications to perform biclustering, most of them suffer from a high computational complexity which prevents their use in large datasets. In this work we present ScalaParBiBit, a parallel tool to find biclusters on binary data, quite common in many research fields such as text mining, marketing or bioinformatics. ScalaParBiBit takes advantage of the special characteristics of these binary datasets, as well as of an efficient parallel implementation and algorithm, to accelerate the biclustering procedure in distributed-memory systems. The experimental evaluation proves that our tool is significantly faster and more scalable that the state-of-the-art tool ParBiBit in a cluster with 32 nodes and 768 cores. Our tool together with its reference manual are freely available at https://github.com/fraguela/ScalaParBiBit.
Similar content being viewed by others
Data availability
The application developed in this manuscript, together with building and usage instructions, as well as the datasets used in the experiments are publicly available under an open source license at https://github.com/fraguela/ScalaParBiBit.
References
Bhatnagar, R., Kumar, L.: High performance parallel/distributed biclustering using Barycenter heuristic. In: 2009 SIAM International Conference on Data Mining, Sparks, SDM 2009, pp 1050–1061 (2009)
Chen, H.C., Zou, W., Tien, Y.J., Chen, J.J.: Identification of bicluster regions in a binary matrix and its applications. PLoS ONE 8(8), e71680 (2013)
Feng, G., Li, Z., Zhou, W., Dong, S.: Entropy-based outlier detection using Spark. Clust. Comput. 23(2), 409–419 (2020)
González, C.H., Fraguela, B.B.: Enhancing and evaluating the configuration capability of a skeleton for irregular computations. In: 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Turku, PDP 2015, pp 119–127 (2015)
González, C.H., Fraguela, B.B.: A general and efficient divide-and-conquer algorithm framework for multi-core clusters. Clust. Comput. 20(3), 2605–2626 (2017)
González-Domínguez, J., Expósito, R.R.: ParBiBit: parallel tool for binary biclustering on modern distributed-memory systems. PLoS ONE 13(4), e01943 (2018)
González-Domínguez, J., Expósito, R.R.: Accelerating binary biclustering on platforms with CUDA-enabled GPUs. Inf. Sci. 496, 317–325 (2019)
Hoefler, T., Dinan, J., Thakur, R., Barrett, B., Balaji, P., Gropp, W., Underwood, K.: Remote memory access programming in MPI-3. ACM Trans. Parallel Comput. 2(2), 9:1-9:26 (2015)
Isokpehi, R.D., Johnson, M.O., Campos, B., Sanders, A., Cozart, T., Harvey, I.S.: Knowledge visualizations to inform decision making for improving food accessibility and reducing obesity rates in the United States. Int. J. Environ. Res. Public Health 17(4), 1263 (2020)
Jiang, F., Leung, CKS.: Mining interesting following patterns from social networks. In: 16th International Conference on Data Warehousing and Knowledge Discovery, Munich, DaWaK 2014, pp 308–319 (2014)
Koniges, A., Cook, B., Deslippe, J., Kurth, T., Shan, H.: MPI usage at NERSC: present and future. In: 23rd European MPI Users’ Group Meeting, Edinburgh, EuroMPI 2016, pp 217–217 (2016)
Lee, Y., Kim, Y., Yeom, H.Y.: Lightweight memory tracing for hot data identification. Clust. Comput. 23(3), 2273–2285 (2020)
Li, Z., Chang, C., Kundu, S., Long, Q.: Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics 21(3), 610–624 (2020)
Lin, Q., Xue, Y., Chen, W.S., Ye, S.Q., Li, W.L., Liu, J.J.: Parallel large average submatrices biclustering based on MapReduce. In: 11th International Conference on Computational Intelligence and Security, Shenzhen, CIS 2015 (2015)
Lin, Q., Zhang, H., Wang, X., Xue, Y., Liu, H., Gong, C.: A novel parallel biclustering approach and its application to identify and segment highly profitable telecom customers. IEEE Access 7, 28696–28711 (2019)
López-Fernández, A., Rodríguez-Baena, D., Gómez-Vela, F., Divina, F., García-Torres, M.: A multi-GPU biclustering algorithm for binary datasets. J. Parallel Distrib. Comput. 147, 209–219 (2021)
Nisar, A., Ahmad, W., Liao, WK., Choudhary, A.: An efficient Map-Reduce algorithm for computing formal concepts from binary data. In: 3rd IEEE International Conference on Big Data, Santa Clara, Big Data 2015, pp 1519–1528 (2015)
Padilha, V.A., Campello, R.: A systematic comparative evaluation of biclustering techniques. BMC Bioinform. 18, 55 (2017)
Pontes, B., Giráldez, R., Aguilar-Ruiz, J.S.: Biclustering on expression data: a review. J. Biomed. Inf. 57, 163–180 (2015)
Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)
Rathipriya, R.: A novel evolutionary biclustering approach using MapReduce (EBC-MR). Int. J. Knowl. Discov. Bioinform. 6(1), 26–36 (2016)
Rocha, O., Mendes, R.: JBiclustGE: Java API with unified biclustering algorithms for gene expression data analysis. Knowl.-Based Syst. 155, 83–87 (2018)
Rodriguez, M.Z., Comin, C.H., Casanova, D., Bruno, O.M., Amancio, D.R., Costa, Ld.F., Rodrigues, F.A.: Clustering algorithms: a comparative approach. PLoS ONE 14(1), 2102 (2019)
Rodríguez-Baena, D.S., Pérez-Pulido, A.J., Aguilar-Ruiz, J.S.: A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27(19), 2738–2745 (2011)
Sarazin, T., Lebbah, M., Azzag, H.: Biclustering using Spark-MapReduce. In: 2nd IEEE International Conference on Big Data, Washington, DC, Big Data 2014, pp 58–60 (2014)
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O.P., Tiwari, A., Er, M.J., Ding, W., Lin, C.T.: A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017)
Stroustrup, B.: The C++ programming language, 4th edn. Addison-Wesley Professional, Boston (2013)
Wei, L., Ling, C.: A parallel algorithm for gene expressing data biclustering. J. Comput. 3(10), 71–77 (2008)
Wu, H., Cheng, S., Wang, Z., Zhang, S., Yuan, F.: Multi-task learning based on question-answering style reviews for aspect category classification and aspect term extraction on GPU clusters. Clust. Comput. 23(3), 1973–1986 (2020)
Yoon, S., Nguyen, H.C., Jo, W., Kim, J., Chi, S.M., Park, J., Kim, S.Y., Nam, D.: Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets. Nucleic Acids Res. 47(9), e53–e53 (2019)
Acknowledgements
This research was supported by the Ministry of Science and Innovation of Spain (TIN2016-75845-P and PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/ 501100011033), and by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04). We acknowledge also the support from the Centro Singular de Investigación de Galicia “CITIC”, funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01. We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the usage of their resources.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fraguela, B.B., Andrade, D. & González-Domínguez, J. ScalaParBiBit: scaling the binary biclustering in distributed-memory systems. Cluster Comput 24, 2249–2268 (2021). https://doi.org/10.1007/s10586-021-03261-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03261-z