research-article

ShapeVis: High-dimensional Data Visualization at Scale

Authors:

Balaji KrishnamurthyAuthors Info & Claims

WWW '20: Proceedings of The Web Conference 2020

Pages 2920 - 2926

https://doi.org/10.1145/3366423.3380058

Published: 20 April 2020 Publication History

Abstract

We present ShapeVis, a scalable visualization technique for point cloud data inspired from topological data analysis. Our method captures the underlying geometric and topological structure of the data in a compressed graphical representation. Much success has been reported by the data visualization technique Mapper, that discreetly approximates the Reeb graph of a filter function on the data. However, when using standard dimensionality reduction algorithms as the filter function, Mapper suffers from considerable computational cost. This makes it difficult to scale to high-dimensional data. Our proposed technique relies on finding a subset of points called landmarks along the data manifold to construct a weighted witness-graph over it. This graph captures the structural characteristics of the point cloud, and its weights are determined using a Finite Markov Chain. We further compress this graph by applying induced maps from standard community detection algorithms. Using techniques borrowed from manifold tearing, we prune and reinstate edges in the induced graph based on their modularity to summarize the shape of data. We empirically demonstrate how our technique captures the structural characteristics of real and synthetic data sets. Further, we compare our approach with Mapper using various filter functions like t-SNE, UMAP, LargeVis and show that our algorithm scales to millions of data points while preserving the quality of data visualization.

References

[1]

Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15, 6 (2003), 1373–1396.

Digital Library

[2]

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8(2013), 1798–1828.

Digital Library

[3]

Silvia Biasotti, Daniela Giorgi, Michela Spagnuolo, and Bianca Falcidieno. 2008. Reeb graphs for shape analysis and applications. Theoretical computer science 392, 1-3 (2008), 5–22.

[4]

Mickaël Buchet, Yasuaki Hiraoka, and Ippei Obayashi. 2018. Persistent homology and materials informatics. In Nanoinformatics. Springer, Singapore, 75–95.

[5]

Pablo G Cámara. 2017. Topological methods for genomics: present and future directions. Current opinion in systems biology 1 (2017), 95–101.

[6]

Pablo G Camara, Daniel IS Rosenbloom, Kevin J Emmett, Arnold J Levine, and Raul Rabadan. 2016. Topological data analysis generates high-resolution, genome-wide maps of human recombination. Cell systems 3, 1 (2016), 83–94.

[7]

Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zomorodian. 2008. On the local behavior of spaces of natural images. International journal of computer vision 76, 1 (2008), 1–12.

Digital Library

[8]

Mathieu Carriere, Bertrand Michel, and Steve Oudot. 2018. Statistical analysis and parameter selection for Mapper. The Journal of Machine Learning Research 19, 1 (2018), 478–516.

Digital Library

[9]

Corrie J Carstens and Kathy J Horadam. 2013. Persistent homology of collaboration networks. Mathematical problems in engineering 2013 (2013).

[10]

Anne Collins, Afra Zomorodian, Gunnar Carlsson, and Leonidas J Guibas. 2004. A barcode shape descriptor for curve point cloud data. Computers & Graphics 28, 6 (2004), 881–894.

Digital Library

[11]

Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Delaunay triangulations: height interpolation. Computational Geometry: Algorithms and Applications (2008), 191–218.

[12]

Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, and Alessandro Provetti. 2011. Generalized louvain method for community detection in large networks. In 2011 11th International Conference on Intelligent Systems Design and Applications. IEEE, 88–93.

[13]

Vin de Silva and Gunnar Carlsson. 2004. Topological estimation using witness complexes. In Proceedings of the First Eurographics conference on Point-Based Graphics. Eurographics Association, 157–166. https://doi.org/10.2312/SPBG/SPBG04/157-166

Digital Library

[14]

Vin de Silva and Robert Ghrist. 2007. Coverage in sensor networks via persistent homology. Algebr. Geom. Topol. 7, 1 (2007), 339–358. https://doi.org/10.2140/agt.2007.7.339

[15]

Bishal Deb, Ankita Sarkar, Nupur Kumari, Akash Rupela, Piyush Gupta, and Balaji Krishnamurthy. 2018. Multimapper: Data Density Sensitive Topological Visualization. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 1054–1061.

[16]

Marian Gidea. 2017. Topological data analysis of critical transitions in financial networks. In International Conference and School on Network Science. Springer, 47–59.

[17]

Leonidas J Guibas and Steve Y Oudot. 2008. Reconstruction using witness complexes. Discrete & computational geometry 40, 3 (2008), 325–356.

[18]

Allen Hatcher. 2002. Algebraic topology. Cambridge University Press, Cambridge. xii+544 pages.

[19]

Geoffrey E Hinton and Sam T Roweis. 2003. Stochastic neighbor embedding. In Advances in neural information processing systems. 857–864.

[20]

Violeta Kovacev-Nikolic, Peter Bubenik, Dragan Nikolić, and Giseon Heo. 2016. Using persistent homology and dynamical distances to analyze protein binding. Statistical applications in genetics and molecular biology 15, 1(2016), 19–38.

[21]

Johannes F Kruiger, Paulo E Rauber, Rafael M Martins, Andreas Kerren, Stephen Kobourov, and Alexandru C Telea. 2017. Graph Layouts by t-SNE. In Computer Graphics Forum, Vol. 36. Wiley Online Library, 283–294.

[22]

Joseph B Kruskal and Myron Wish. 1978. Multidimensional scaling. Vol. 11. Sage.

[23]

Yan Lecun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition.

[24]

Hyekyoung Lee, Moo K Chung, Hyejin Kang, Bung-Nyun Kim, and Dong Soo Lee. 2011. Discriminative persistent homology of brain networks. In 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, 841–844.

[25]

John Aldo Lee and Michel Verleysen. 2005. Nonlinear dimensionality reduction of data manifolds with essential loops. Neurocomputing 67(2005), 29–53.

Digital Library

[26]

Yongjin Lee, Senja D Barthel, Paweł Dłotko, Seyed Mohamad Moosavi, Kathryn Hess, and Berend Smit. 2018. High-throughput screening approach for nanoporous materials genome using topological data analysis: application to zeolites. Journal of chemical theory and computation 14, 8 (2018), 4427–4437.

[27]

Ying-Ke Lei, Zhu-Hong You, Tianbao Dong, Yun-Xiao Jiang, and Jun-An Yang. 2013. Increasing reliability of protein interactome by fast manifold embedding. Pattern Recognition Letters 34, 4 (2013), 372–379.

Digital Library

[28]

David Letscher and Jason Fritts. 2007. Image segmentation using topological persistence. In International Conference on Computer Analysis of Images and Patterns. Springer, 587–595.

[29]

Dong Liang, Chen Qiao, and Zongben Xu. 2015. Enhancing both efficiency and representational capability of isomap by extensive landmark selection. Mathematical Problems in Engineering 2015 (2015).

[30]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.

[31]

Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426(2018).

[32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.

[33]

James Munkres. 2014. Topology. Pearson Education.

[34]

Monica Nicolau, Arnold J Levine, and Gunnar Carlsson. 2011. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences 108, 17(2011), 7265–7270.

[35]

Alice Patania, Giovanni Petri, and Francesco Vaccarino. 2017. The shape of collaborations. EPJ Data Science 6, 1 (2017), 18.

[36]

Nicola Pezzotti, Thomas Höllt, B Lelieveldt, Elmar Eisemann, and Anna Vilanova. 2016. Hierarchical stochastic neighbor embedding. In Computer Graphics Forum, Vol. 35. Wiley Online Library, 21–30.

[37]

Ashutosh Saxena, Abhinav Gupta, and Amitabha Mukerjee. 2004. Non-linear dimensionality reduction by locally linear isomaps. In International Conference on Neural Information Processing. Springer, 1038–1043.

[38]

Michael Sedlmair, Tamara Munzner, and Melanie Tory. 2013. Empirical Guidance on Scatterplot and Dimension Reduction Technique Choices.IEEE Trans. Vis. Comput. Graph. 19, 12 (2013), 2634–2643. http://dblp.uni-trier.de/db/journals/tvcg/tvcg19.html#SedlmairMT13

Digital Library

[39]

Lee M Seversky, Shelby Davis, and Matthew Berger. 2016. On time-series topological data analysis: New data and opportunities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 59–67.

[40]

Hao Shi, Baoqun Yin, Yu Kang, Chao Shao, and Jie Gui. 2017. Robust l-Isomap with a novel landmark selection method. Mathematical Problems in Engineering 2017 (2017).

[41]

Vin D Silva and Joshua B Tenenbaum. 2003. Global versus local methods in nonlinear dimensionality reduction. In Advances in neural information processing systems. 721–728.

[42]

Gurjeet Singh, Facundo Mémoli, and Gunnar E Carlsson. 2007. Topological methods for the analysis of high dimensional data sets and 3d object recognition. In SPBG. 91–100.

[43]

Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. 2016. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee, 287–297.

Digital Library

[44]

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web. International World Wide Web Conferences Steering Committee, 1067–1077.

Digital Library

[45]

Vincent A Traag, Ludo Waltman, and Nees Jan van Eck. 2019. From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports 9(2019).

[46]

Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. The Journal of Machine Learning Research 15, 1 (2014), 3221–3245.

[47]

Martin Wattenberg, Fernanda Viégas, and Ian Johnson. 2016. How to Use t-SNE Effectively. Distill (2016). https://doi.org/10.23915/distill.00002

[48]

Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3 (1987), 37–52.

[49]

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:cs.LG/cs.LG/1708.07747

[50]

Jaewon Yang and Jure Leskovec. 2015. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42, 1 (2015), 181–213.

Digital Library

Cited By

Minch BNowak MWcisło RDzwinel W(2020)GPU-Embedding of kNN-Graph Representing Large and High-Dimensional DataComputational Science – ICCS 202010.1007/978-3-030-50417-5_24(322-336)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50417-5_24

Index Terms

ShapeVis: High-dimensional Data Visualization at Scale

Index terms have been assigned to the content through auto-classification.

Recommendations

Visualizing Large-scale and High-dimensional Data
WWW '16: Proceedings of the 25th International Conference on World Wide Web

We study the problem of visualizing large-scale and high-dimensional data in a low-dimensional (typically 2D or 3D) space. Much success has been reported recently by techniques that first compute a similarity structure of the data points and then ...
Supervised nonlinear dimensionality reduction for visualization and classification

When performing visualization and classification, people often confront the problem of dimensionality reduction. Isomap is one of the most promising nonlinear dimensionality reduction techniques. However, when Isomap is applied to real-world data, it ...
k-means discriminant maps for data visualization and classification
SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

Over the years, many dimensionality reduction algorithms have been proposed for learning the structure of high dimensional data by linearly or non-linearly transforming it into a low-dimensional space. Some techniques can keep the local structure of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '20: Proceedings of The Web Conference 2020

April 2020

3143 pages

ISBN:9781450370233

DOI:10.1145/3366423

Editors:
Yennun Huang
Acadmica sinica, Taiwan
,
Irwin King
The Chinese University of Hong Kong, Hong Kong
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Maarten van Steen
University of Twente, Netherlands

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '20

Sponsor:

SIGWEB

WWW '20: The Web Conference 2020

April 20 - 24, 2020

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
293
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Minch BNowak MWcisło RDzwinel W(2020)GPU-Embedding of kNN-Graph Representing Large and High-Dimensional DataComputational Science – ICCS 202010.1007/978-3-030-50417-5_24(322-336)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50417-5_24

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents