ScalaParBiBit: scaling the binary biclustering in distributed-memory systems

Basilio B. Fraguela ORCID: orcid.org/0000-0002-3438-5960¹,
Diego Andrade¹ &
Jorge González-Domínguez¹

285 Accesses
Explore all metrics

Abstract

Biclustering is a data mining technique that allows us to find groups of rows and columns that are highly correlated in a 2D dataset. Although there exist several software applications to perform biclustering, most of them suffer from a high computational complexity which prevents their use in large datasets. In this work we present ScalaParBiBit, a parallel tool to find biclusters on binary data, quite common in many research fields such as text mining, marketing or bioinformatics. ScalaParBiBit takes advantage of the special characteristics of these binary datasets, as well as of an efficient parallel implementation and algorithm, to accelerate the biclustering procedure in distributed-memory systems. The experimental evaluation proves that our tool is significantly faster and more scalable that the state-of-the-art tool ParBiBit in a cluster with 32 nodes and 768 cores. Our tool together with its reference manual are freely available at https://github.com/fraguela/ScalaParBiBit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributing Frank–Wolfe via map-reduce

Article 18 December 2018

Scalable Machine Learning in the R Language Using a Summarization Matrix

MR-DIS: democratic instance selection for big data by MapReduce

Article 10 February 2017

Data availability

The application developed in this manuscript, together with building and usage instructions, as well as the datasets used in the experiments are publicly available under an open source license at https://github.com/fraguela/ScalaParBiBit.

References

Bhatnagar, R., Kumar, L.: High performance parallel/distributed biclustering using Barycenter heuristic. In: 2009 SIAM International Conference on Data Mining, Sparks, SDM 2009, pp 1050–1061 (2009)
Chen, H.C., Zou, W., Tien, Y.J., Chen, J.J.: Identification of bicluster regions in a binary matrix and its applications. PLoS ONE 8(8), e71680 (2013)
Article Google Scholar
Feng, G., Li, Z., Zhou, W., Dong, S.: Entropy-based outlier detection using Spark. Clust. Comput. 23(2), 409–419 (2020)
Article Google Scholar
González, C.H., Fraguela, B.B.: Enhancing and evaluating the configuration capability of a skeleton for irregular computations. In: 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Turku, PDP 2015, pp 119–127 (2015)
González, C.H., Fraguela, B.B.: A general and efficient divide-and-conquer algorithm framework for multi-core clusters. Clust. Comput. 20(3), 2605–2626 (2017)
Article Google Scholar
González-Domínguez, J., Expósito, R.R.: ParBiBit: parallel tool for binary biclustering on modern distributed-memory systems. PLoS ONE 13(4), e01943 (2018)
Article Google Scholar
González-Domínguez, J., Expósito, R.R.: Accelerating binary biclustering on platforms with CUDA-enabled GPUs. Inf. Sci. 496, 317–325 (2019)
Article Google Scholar
Hoefler, T., Dinan, J., Thakur, R., Barrett, B., Balaji, P., Gropp, W., Underwood, K.: Remote memory access programming in MPI-3. ACM Trans. Parallel Comput. 2(2), 9:1-9:26 (2015)
Article Google Scholar
Isokpehi, R.D., Johnson, M.O., Campos, B., Sanders, A., Cozart, T., Harvey, I.S.: Knowledge visualizations to inform decision making for improving food accessibility and reducing obesity rates in the United States. Int. J. Environ. Res. Public Health 17(4), 1263 (2020)
Article Google Scholar
Jiang, F., Leung, CKS.: Mining interesting following patterns from social networks. In: 16th International Conference on Data Warehousing and Knowledge Discovery, Munich, DaWaK 2014, pp 308–319 (2014)
Koniges, A., Cook, B., Deslippe, J., Kurth, T., Shan, H.: MPI usage at NERSC: present and future. In: 23rd European MPI Users’ Group Meeting, Edinburgh, EuroMPI 2016, pp 217–217 (2016)
Lee, Y., Kim, Y., Yeom, H.Y.: Lightweight memory tracing for hot data identification. Clust. Comput. 23(3), 2273–2285 (2020)
Article Google Scholar
Li, Z., Chang, C., Kundu, S., Long, Q.: Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics 21(3), 610–624 (2020)
Article MathSciNet Google Scholar
Lin, Q., Xue, Y., Chen, W.S., Ye, S.Q., Li, W.L., Liu, J.J.: Parallel large average submatrices biclustering based on MapReduce. In: 11th International Conference on Computational Intelligence and Security, Shenzhen, CIS 2015 (2015)
Lin, Q., Zhang, H., Wang, X., Xue, Y., Liu, H., Gong, C.: A novel parallel biclustering approach and its application to identify and segment highly profitable telecom customers. IEEE Access 7, 28696–28711 (2019)
Article Google Scholar
López-Fernández, A., Rodríguez-Baena, D., Gómez-Vela, F., Divina, F., García-Torres, M.: A multi-GPU biclustering algorithm for binary datasets. J. Parallel Distrib. Comput. 147, 209–219 (2021)
Article Google Scholar
Nisar, A., Ahmad, W., Liao, WK., Choudhary, A.: An efficient Map-Reduce algorithm for computing formal concepts from binary data. In: 3rd IEEE International Conference on Big Data, Santa Clara, Big Data 2015, pp 1519–1528 (2015)
Padilha, V.A., Campello, R.: A systematic comparative evaluation of biclustering techniques. BMC Bioinform. 18, 55 (2017)
Article Google Scholar
Pontes, B., Giráldez, R., Aguilar-Ruiz, J.S.: Biclustering on expression data: a review. J. Biomed. Inf. 57, 163–180 (2015)
Article Google Scholar
Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)
Article Google Scholar
Rathipriya, R.: A novel evolutionary biclustering approach using MapReduce (EBC-MR). Int. J. Knowl. Discov. Bioinform. 6(1), 26–36 (2016)
Article MathSciNet Google Scholar
Rocha, O., Mendes, R.: JBiclustGE: Java API with unified biclustering algorithms for gene expression data analysis. Knowl.-Based Syst. 155, 83–87 (2018)
Article Google Scholar
Rodriguez, M.Z., Comin, C.H., Casanova, D., Bruno, O.M., Amancio, D.R., Costa, Ld.F., Rodrigues, F.A.: Clustering algorithms: a comparative approach. PLoS ONE 14(1), 2102 (2019)
Google Scholar
Rodríguez-Baena, D.S., Pérez-Pulido, A.J., Aguilar-Ruiz, J.S.: A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27(19), 2738–2745 (2011)
Article Google Scholar
Sarazin, T., Lebbah, M., Azzag, H.: Biclustering using Spark-MapReduce. In: 2nd IEEE International Conference on Big Data, Washington, DC, Big Data 2014, pp 58–60 (2014)
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O.P., Tiwari, A., Er, M.J., Ding, W., Lin, C.T.: A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017)
Article Google Scholar
Stroustrup, B.: The C++ programming language, 4th edn. Addison-Wesley Professional, Boston (2013)
MATH Google Scholar
Wei, L., Ling, C.: A parallel algorithm for gene expressing data biclustering. J. Comput. 3(10), 71–77 (2008)
Google Scholar
Wu, H., Cheng, S., Wang, Z., Zhang, S., Yuan, F.: Multi-task learning based on question-answering style reviews for aspect category classification and aspect term extraction on GPU clusters. Clust. Comput. 23(3), 1973–1986 (2020)
Article Google Scholar
Yoon, S., Nguyen, H.C., Jo, W., Kim, J., Chi, S.M., Park, J., Kim, S.Y., Nam, D.: Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets. Nucleic Acids Res. 47(9), e53–e53 (2019)
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Ministry of Science and Innovation of Spain (TIN2016-75845-P and PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/ 501100011033), and by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04). We acknowledge also the support from the Centro Singular de Investigación de Galicia “CITIC”, funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01. We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the usage of their resources.

Author information

Authors and Affiliations

Universidade da Coruña, CITIC, Grupo de Arquitectura de Computadores, Facultade de Informática, Campus de Elviña, S/N. 15071, A Coruña, Spain
Basilio B. Fraguela, Diego Andrade & Jorge González-Domínguez

Authors

Basilio B. Fraguela
View author publications
You can also search for this author in PubMed Google Scholar
Diego Andrade
View author publications
You can also search for this author in PubMed Google Scholar
Jorge González-Domínguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Basilio B. Fraguela.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fraguela, B.B., Andrade, D. & González-Domínguez, J. ScalaParBiBit: scaling the binary biclustering in distributed-memory systems. Cluster Comput 24, 2249–2268 (2021). https://doi.org/10.1007/s10586-021-03261-z

Download citation

Received: 31 July 2020
Revised: 18 February 2021
Accepted: 06 March 2021
Published: 19 March 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10586-021-03261-z

ScalaParBiBit: scaling the binary biclustering in distributed-memory systems

Abstract

Access this article

Subscribe and save

Buy Now