Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis
<p>All possible position of quartets of four nodes.</p> "> Figure 2
<p>Looped rows example.</p> "> Figure 3
<p>Binary tree generated using Zgli on the raw sizes of the Basketball data [<a href="#B29-sensors-23-01219" class="html-bibr">29</a>]. It can be seen that the shootings are the only type of data that is clustered together. Furthermore, the rest of the data is clustered linearly.</p> "> Figure 4
<p>Binary tree generated using the looped data and bzlib without compression by column over the Basketball data set [<a href="#B29-sensors-23-01219" class="html-bibr">29</a>].</p> "> Figure 5
<p>Binary tree generated using the looped data and bzlib with compression by column over the Basketball data set [<a href="#B29-sensors-23-01219" class="html-bibr">29</a>].</p> ">
Abstract
:1. Introduction
2. Background
2.1. Kolmogorov Complexity
: “ggggggggggggggggg” | with size ; |
: “LATQvgkCQaNwEadqO” | with size . |
: “return ’g’ * 17” | with 4; |
: “return ’LATQvgkCQaNwEadqO’” | with 17. |
2.2. Normalized Compression Distance
2.3. The Quartet Method
3. Complearn
- File Mode—takes, as an argument, a filename whose contents will be compressed.
- String Literal Mode—takes, as an argument, a string whose contents will be compressed. By default, each string literal is separated by white space. If a string contains literals with white space, that is surrounded with double quotes.
- Plain List Mode—takes, as an argument, a filename, which contains a list of filenames to be individually compressed. A line break separates each filename.
- Term List Mode—takes, as an argument, a filename whose contents are a list of string literals to be individually compressed. A line break separates each string character.
- Directory Mode—takes, as an argument, the name of a directory whose file contents will be used to compute the distance matrix.
4. Zgli
- Folder—this class performs operations inside a folder containing the files intended for compression and clustering.
- Encoder—this class was designed to perform all the operations regarding the tabular encoding of data.
4.1. Folder
4.2. Encoder
- Row—0,1,0,2;
- ASCII string—0123456789abcdefg (...);
- Hop—1.
- 0—000000;
- 1—010101;
- 2—012012.
- 0—000000000000000;
- 1—012012012012012;
- 2—012340123401234.
- 0—000000000000000;
- 1—012012012012012;
- 2—012340123401234.
- 0—000000000000000;
- 1—012340123401234.
5. Validation Tests
- Question 1:
- Is it possible to improve the clustering by compression results of tabular data with the new compression by column option?
- Question 2:
- Is it possible to enhance the standard results for clustering (using the Euclidean distance, for example) for clustering categorical data using the Zgli encoder and the NCD?
5.1. Question 1—Improving Clustering Results Using Compressing by Columns
5.2. Question 2—Improving Results by Using Zgli Encoder
6. Clinical Use Case
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Xu, D.; Tian, Y. A Comprehensive Survey of Clustering Algorithms. Ann. Data Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef] [Green Version]
- Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.T. A review of clustering techniques and developments. Neurocomputing 2017, 267, 664–681. [Google Scholar] [CrossRef] [Green Version]
- Henriques, R.; Madeira, S.C. FleBiC: Learning classifiers from high-dimensional biomedical data using discriminative biclusters with non-constant patterns. Pattern Recognit. 2021, 115, 107900. [Google Scholar] [CrossRef]
- Soares, D.F.; Henriques, R.; Gromicho, M.; de Carvalho, M.; Madeira, S.C. Learning prognostic models using a mixture of biclustering and triclustering: Predicting the need for non-Invasive ventilation in Amyotrophic Lateral Sclerosis. J. Biomed. Inform. 2022, 134, 104172. [Google Scholar] [CrossRef] [PubMed]
- Hendricks, R.M.; Khasawneh, M.T. A Systematic Review of Parkinson’s Disease Cluster Analysis Research. Aging Dis. 2021, 12, 1567–1586. [Google Scholar] [CrossRef] [PubMed]
- Molano-González, N.; Rojas, M.; Monsalve, D.M.; Pacheco, Y.; Acosta-Ampudia, Y.; Rodríguez, Y.; Rodríguez-Jimenez, M.; Ramírez-Santana, C.; Anaya, J.M. Cluster analysis of autoimmune rheumatic diseases based on autoantibodies. New insights for polyautoimmunity. J. Autoimmun. 2019, 98, 24–32. [Google Scholar] [CrossRef]
- de Souto, M.C.; Costa, I.G.; de Araujo, D.S.; Ludermir, T.B.; Schliep, A. Clustering cancer gene expression data: A comparative study. BMC Bioinform. 2008, 9, 497. [Google Scholar] [CrossRef] [Green Version]
- Barata, C.; Rodrigues, A.M.; Canhão, H.; Vinga, S.; Carvalho, A. Predicting Biologic Therapy Outcome of Patients With Spondyloarthritis: Joint Models for Longitudinal and Survival Analysis. JMIR Med. Inform. 2021, 9, e26823. [Google Scholar] [CrossRef]
- Rama, K.; Canhão, H.; Carvalho, A.; Vinga, S. AliClu—Temporal sequence alignment for clustering longitudinal clinical data. BMC Med. Inform. Decis. Mak. 2019, 19, 289. [Google Scholar] [CrossRef] [Green Version]
- Gunopulos, D. Cluster and Distance Measure. In Encyclopedia of Database Systems; Liu, L., Ozsu, M.T., Eds.; Springer US: New York, NY, USA, 2009; pp. 374–375. [Google Scholar] [CrossRef]
- Cilibrasi, R.; Vitanyi, P.; Wolf, R. Algorithmic clustering of music. In Proceedings of the Fourth International Conference on Web Delivering of Music, 2004, EDELMUSIC 2004, IEEE, Barcelona, Spain, 4–14 September 2004; pp. 110–117. [Google Scholar] [CrossRef] [Green Version]
- Wehner, S. Analyzing Worms and Network Traffic Using Compression. J. Comput. Secur. 2007, 15, 303–320. [Google Scholar] [CrossRef] [Green Version]
- Souto, A. Traffic analysis based on compression. In Proceedings of the Conferência sobre Redes de Computadores CRC 15, Évora, Portugal, July 2015; pp. 1–7. [Google Scholar]
- Resende, J.S.; Sousa, P.R.; Martins, R.; Antunes, L. Breaking MPC implementations through compression. Int. J. Inf. Secur. 2019, 18, 505–518. [Google Scholar] [CrossRef]
- Li, M.; Badger, J.; Chen, X.; Kwong, S.; Kearney, P.; Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17, 149–154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cilibrasi, R.; Vitányi, P. Clustering by compression. IEEE Trans. Inf. Theory 2005, 51, 1523–1545. [Google Scholar] [CrossRef] [Green Version]
- Santos, C.; Bernardes, J.; Vitanyi, P.; Antunes, L. Clustering Fetal Heart Rate Tracings by Compression. In Proceedings of the Computer-Based Medical Systems, 2006, CBMS 2006, 19th IEEE International Symposium on Computer-Based Medical Systems (CBMS’06), Salt Lake City, UT, USA, 22–23 June 2006; pp. 685–690. [Google Scholar] [CrossRef]
- Cebrian, M.; Alfonseca, M.; Ortega, A. The Normalized Compression Distance Is Resistant to Noise. IEEE Trans. Inf. Theory 2007, 53, 1895–1900. [Google Scholar] [CrossRef]
- Cilibrasi, R.; Vitányi, P. Phylogeny of the COVID-19 Virus SARS-CoV-2 by Compression. Entropy 2022, 24, 439. [Google Scholar] [CrossRef] [PubMed]
- Machado, J.A.T.; Rocha-Neves, J.M.; Andrade, J.P. Computational analysis of the SARS-CoV-2 and other viruses based on the Kolmogorov’s complexity and Shannon’s information theories. Nonlinear Dyn 2020, 101, 1731–1750. [Google Scholar] [CrossRef]
- Azevedo, D.; Souto, A. Import Zgli a Clustering Technique. 2022. Available online: https://zgly-92273.web.app/ (accessed on 25 October 2022).
- TIOBE Software BV. Tiobe Index. Available online: https://www.tiobe.com/tiobe-index/ (accessed on 25 October 2022).
- Developer Nation. What Is the Best Programming Language for Machine Learning? 2019. Available online: https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7 (accessed on 25 October 2022).
- Li, M.; Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications, 4th ed.; Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2019. [Google Scholar]
- Li, M.; Chen, X.; Li, X.; Ma, B.; Vitanyi, P. The similarity metric. IEEE Trans. Inf. Theory 2004, 50, 3250–3264. [Google Scholar] [CrossRef]
- Cilibrasi, R.; Cruz, A.; Rooij, S. CompLearn. 2008. Available online: https://complearn.org/ (accessed on 18 January 2023).
- Ellson, J.; Gansner, E.; Hu, Y.; North, S.; Jacobsson, M.; Fernandez, M.; Hansen, M.; Alexiev, V.; Bilgin, A.; Caldwell, D.; et al. Graphviz. Available online: https://graphviz.org/ (accessed on 18 January 2023).
- Dua, D.; Graff, C. Iris Dataset, UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 18 January 2023).
- Guarin, D.; Gloria, J.; Naranjo, L. Basketball Dataset, UCI Machine Learning Repository. 2019. Available online: https://archive.ics.uci.edu/ml/datasets/Basketball+dataset (accessed on 18 January 2023).
- Mahmood, F.; Helliwell, P. Ankylosing Spondylitis: A review. EMJ Rheumatol. 2017, 2, 134–139. [Google Scholar] [CrossRef]
- Canhão, H.; Faustino, A.; Martins, F.; Fonseca, J.E.; Rheumatic Diseases Portuguese Register Board Coordination, Portuguese Society of Rheumatology. Reuma.pt - the rheumatic diseases portuguese register. Acta Reumatol. Port 2011, 36, 45–56. [Google Scholar]
- Calin, A.; Garrett, S.; Whitelock, H.; Kennedy, L.G.; O’Hea, J.; Mallone, P.; Jenkinson, T. A new approach to defining functional ability in ankylosing spondylitis: The development of the Bath Ankylosing Spondylitis Functional Index. Class. Pap. Rheumatol. 1994, 21, 2281–2285. [Google Scholar] [CrossRef]
- Machado, P.M.; Landewé, R.; van der Heijde, D. Ankylosing Spondylitis Disease Activity Score (ASDAS): 2018 update of the nomenclature for disease activity states. Ann. Rheum. Dis. 2018, 77, 1539–1540. [Google Scholar] [CrossRef] [PubMed]
- Machado, P.; Landewe, R.; Lie, E.; Kvien, T.K.; Braun, J.; Baker, D.; van der Heijde, D. Ankylosing spondylitis disease activity score (ASDAS): Defining cut-off values for disease activity states and improvement scores. Ann. Rheum. Dis. 2010, 70, 47–53. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ramiro, S.; Nikiphorou, E.; Sepriano, A.; Ortolan, A.; Webers, C.; Baraliakos, X.; Landewé, R.B.; Van den Bosch, F.E.; Boteva, B.; Bremander, A.; et al. Asas-EULAR recommendations for the management of Axial Spondyloarthritis: 2022 update. Ann. Rheum. Dis. 2022, 82, 19–34. [Google Scholar] [CrossRef]
- Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef] [PubMed]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Rosenberg, A.; Hirschberg, J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 410–420. [Google Scholar]
- 2.3. Clustering. Available online: https://scikit-learn.org/stable/modules/clustering.html#rand-index (accessed on 18 November 2022).
Feature1 | Feature2 | Feature3 | Feature4 |
---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 |
4.9 | 3.0 | 1.4 | 0.2 |
7.0 | 3.2 | 4.7 | 1.4 |
6.4 | 3.2 | 4.5 | 1.5 |
6.3 | 3.3 | 6.0 | 2.5 0.2 |
Feature1 | Feature2 | Feature3 | Feature4 |
---|---|---|---|
(4.296, 5.2] | (3.2, 3.8] | (0.994, 2.475] | (0.0976, 0.7] |
(4.296, 5.2] | (2.6, 3.2] | (0.994, 2.475] | (0.0976, 0.7] |
(6.1, 7.0] | (2.6, 3.2] | (3.95, 5.425] | (1.3, 1.9] |
(6.1, 7.0] | (2.6, 3.2] | (3.95, 5.425] | (1.3, 1.9] |
(6.1, 7.0] | (3.2, 3.8] | (5.425, 6.9] | (1.9, 2.5] |
Feature1 | Feature2 | Feature3 | Feature4 |
---|---|---|---|
0 | 2 | 0 | 0 |
0 | 1 | 0 | 0 |
2 | 1 | 2 | 2 |
2 | 1 | 2 | 2 |
2 | 2 | 3 | 3 |
Feature1 | Feature2 | Feature3 | Feature4 |
---|---|---|---|
000000000000 | 012012012012 | 000000000000 | 000000000000 |
000000000000 | 010101010101 | 000000000000 | 000000000000 |
012012012012 | 010101010101 | 012012012012 | 012012012012 |
012012012012 | 010101010101 | 012012012012 | 012012012012 |
012012012012 | 012012012012 | 012301230123 | 012301230123 |
Compressor | Dataset | Compress by Column Option | Tree Scores |
---|---|---|---|
bzlib | raw | Disabled | 0.968793 |
bzlib | raw | Enabled | 0.982927 |
bzlib | looped | Disabled | 0.968793 |
bzlib | looped | Enabled | 0.982227 |
zlib | raw | Disabled | 0.979618 |
zlib | raw | Enabled | 0.991362 |
zlib | looped | Disabled | 0.979618 |
zlib | looped | Enabled | 0.991362 |
lzma | raw | Disabled | 0.996360 |
lzma | raw | Enabled | 0.991362 |
lzma | looped | Disabled | 0.976434 |
lzma | looped | Enabled | 0.996360 |
Compressor | Hop | Agg Ave Acc | Agg Com Acc | Agg Sin Acc |
---|---|---|---|---|
bzlib | 1 | 0.133 | 0.333 | 0.200 |
bzlib | 2 | 0.533 | 0.200 | 0.400 |
bzlib | 3 | 0.733 | 0.333 | 0.733 |
bzlib | 4 | 0.733 | 0.067 | 0.200 |
bzlib | 5 | 0.400 | 0.133 | 0.400 |
bzlib | normal | 0.000 | 0.133 | 0.400 |
lzma | 1 | 0.333 | 0.067 | 0.333 |
lzma | 2 | 0.400 | 0.600 | 0.333 |
lzma | 3 | 0.067 | 0.133 | 0.400 |
lzma | 4 | 0.533 | 0.267 | 0.200 |
lzma | 5 | 0.533 | 0.067 | 0.133 |
lzma | normal | 0.600 | 0.400 | 0.400 |
zlib | 1 | 0.600 | 0.267 | 0.200 |
zlib | 2 | 0.067 | 0.133 | 0.467 |
zlib | 3 | 0.400 | 0.133 | 0.267 |
zlib | 4 | 0.200 | 0.133 | 0.400 |
zlib | 5 | 0.400 | 0.133 | 0.467 |
zlib | normal | 0.467 | 0.200 | 0.333 |
Score | Clusters | Features | Compressor | Model |
---|---|---|---|---|
0.588178 | 8 | 2 | N/A | HC complete |
0.583808 | 8 | 2 | N/A | HC average |
0.56795 | 2 | 3 | bzlib by column | HC average |
0.56795 | 2 | 3 | bzlib by column | K-medoids |
0.56795 | 2 | 3 | bzlib by column | HC complete |
0.56795 | 2 | 3 | bzlib by column | HC single |
0.505713 | 8 | 4 | bzlib by column | HC complete |
0.475152 | 8 | 4 | N/A | HC complete |
0.446612 | 7 | 4 | N/A | HC average |
0.441733 | 6 | 4 | N/A | HC single |
Score | Clusters | Features | Compressor | Model |
---|---|---|---|---|
0.588178 | 8 | 2 | N/A | HC complete |
0.505713 | 8 | 4 | bzlib by column | HC complete |
0.502198 | 8 | 4 | bzlib by column | HC single |
0.475152 | 8 | 4 | N/A | HC complete |
0.450261 | 8 | 5 | bzlib by column | HC complete |
0.446612 | 7 | 4 | N/A | HC average |
0.425628 | 8 | 5 | N/A | HC complete |
0.390852 | 3 | 3 | bzlib by column | K-medoids |
0.390682 | 8 | 5 | bzlib by column | HC single |
0.386077 | 8 | 5 | N/A | HC average |
Score | Clusters | Features | Compressor | Model |
---|---|---|---|---|
0.446612 | 7 | 4 | N/A | HC complete |
0.441733 | 6 | 4 | N/A | HC single |
0.432458 | 8 | 4 | N/A | HC average |
0.412846 | 3 | 3 | bzlib by column | HC complete |
0.407364 | 7 | 5 | N/A | HC complete |
0.390852 | 3 | 3 | bzlib by column | K-medoids |
0.390682 | 8 | 5 | bzlib by column | HC single |
0.386953 | 5 | 5 | N/A | HC average |
0.374891 | 3 | 5 | bzlib by column | HC complete |
0.308163 | 3 | 3 | zlib by column | HC average |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Azevedo, D.; Rodrigues, A.M.; Canhão, H.; Carvalho, A.M.; Souto, A. Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis. Sensors 2023, 23, 1219. https://doi.org/10.3390/s23031219
Azevedo D, Rodrigues AM, Canhão H, Carvalho AM, Souto A. Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis. Sensors. 2023; 23(3):1219. https://doi.org/10.3390/s23031219
Chicago/Turabian StyleAzevedo, Diogo, Ana Maria Rodrigues, Helena Canhão, Alexandra M. Carvalho, and André Souto. 2023. "Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis" Sensors 23, no. 3: 1219. https://doi.org/10.3390/s23031219
APA StyleAzevedo, D., Rodrigues, A. M., Canhão, H., Carvalho, A. M., & Souto, A. (2023). Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis. Sensors, 23(3), 1219. https://doi.org/10.3390/s23031219