Article
Published: 03 December 2019

Visualizing structure and transitions in high-dimensional biological data

Nature Biotechnology volume 37, pages 1482–1492 (2019)Cite this article

62k Accesses
453 Citations
266 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 02 January 2020

This article has been updated

Abstract

The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of PHATE and its ability to reveal structure in data.**

**Fig. 3: Extracting branches and branchpoints from PHATE.**

**Fig. 4: PHATE most accurately represents manifold distances in a 2D embedding.**

**Fig. 5: Comparison of PHATE to other visualization methods on biological datasets.**

**Fig. 6: PHATE analysis of embryoid body scRNA-seq data with n = 16,825 cells.**

Assessing single-cell transcriptomic variability through density-preserving data visualization

Article 18 January 2021

Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets

Article Open access 28 November 2019

Correspondence analysis for dimension reduction, batch integration, and visualization of single-cell RNA-seq data

Article Open access 21 January 2023

Data availability

The embryoid body scRNA-seq and bulk RNA-seq datasets generated and analyzed during the current study are available from the Mendeley Data repository at https://doi.org/10.17632/v6n743h5ng.1. Supplementary Figure 14a contains images of the raw single cells while Supplementary Fig. 14f contains scatter plots showing the gating procedure for fluorescence activated cell sorting populations for the bulk RNA-seq data.

Code availability

Python, R and Matlab implementations of PHATE are available on GitHub (https://github.com/KrishnaswamyLab/PHATE) for academic use.

Change history

02 January 2020
An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Amir, E. D. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).
Article PubMed Central CAS Google Scholar
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).
Article PubMed PubMed Central CAS Google Scholar
Tenenbaum, J. B., De Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Article PubMed CAS Google Scholar
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Article CAS Google Scholar
Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000).
Article PubMed CAS Google Scholar
Cox, T. F. & Cox, M. A. A. Multidimensional Scaling 2nd edn (Chapman & Hall/CRC, 2001).
De Silva, V. & Tenenbaum J. B. Sparse Multidimensional Scaling Using Landmark Points (Stanford University, 2004).
Unen, V. et al. Visual analysis of mass cytometry data by hierarchical stochastic neighbour embedding reveals rare cell types. Nat. Commun. 8, 1740 (2017).
Article PubMed PubMed Central CAS Google Scholar
Chen, L. & Buja, A. Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. J. Am. Stat. Assoc. 104, 209–219 (2009).
Article CAS Google Scholar
Moon, T. K. & Stirling, W. C. Mathematical Methods and Algorithms for Signal Processing (Prentice Hall, 2000).
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Article PubMed PubMed Central CAS Google Scholar
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).
Article Google Scholar
Haghverdi, L., Buettner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Article PubMed CAS Google Scholar
Darrow, E. M. et al. Deletion of DXZ4 on the human inactive X chromosome alters higher-order genome architecture. Proc. Natl Acad. Sci. USA 113, E4504–E4512 (2016).
Article PubMed CAS PubMed Central Google Scholar
Cheng, X., Rachh, M. & Steinerberger, S. On the diffusion geometry of graph Laplacians and applications. Appl. Comput. Harmon. Anal. 46, 674–688 (2019).
Article Google Scholar
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Article PubMed CAS Google Scholar
Zunder, E. R., Lujan, E., Goltsev, Y., Wernig, M. & Nolan, G. P. A continuous molecular roadmap to iPSC reprogramming through progression analysis of single-cell mass cytometry. Cell Stem Cell 16, 323–337 (2015).
Article PubMed PubMed Central CAS Google Scholar
Lui, K., Ding, G. W., Huang, R. & McCann, R. Dimensionality reduction has quantifiable imperfections: two geometric bounds. In Proc. 32nd International Conference on Neural Information Processing Systems (Eds. Bengio, S. et al.) 8453–8463 (Curran Associates, 2018).
Tsai, F. S. A visualization metric for dimensionality reduction. Expert Syst. Appl. 39, 1747–1752 (2012).
Article Google Scholar
Bertini, E., Tatu, A. & Keim, D. Quality metrics in high-dimensional data visualization: an overview and systematization. IEEE Trans. Vis. Comput. Graph. 17, 2203–2212 (2011).
Article PubMed Google Scholar
Maaten, Lvd, Postma, E. & Herik, Jvd Dimensionality reduction: a comparative review. J. Mach. Learn. Res. 10, 66–71 (2009).
Google Scholar
Vankadara, L. C. & von Luxburg, U. Measures of distortion for machine learning. In Proc. 32nd International Conference on Neural Information Processing Systems (Eds. Bengio, S. et al.) 4886–4895 (Curran Associates, 2018).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Article PubMed CAS Google Scholar
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Article PubMed PubMed Central CAS Google Scholar
Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323 (2016).
Article PubMed PubMed Central CAS Google Scholar
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Article PubMed CAS Google Scholar
Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Article PubMed PubMed Central CAS Google Scholar
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 34, 637–645 (2016).
Article PubMed PubMed Central CAS Google Scholar
Liiv, I. Seriation and matrix reordering methods: an historical overview. Stat. Anal. Data Min. 3, 70–91 (2010).
Google Scholar
Hahsler, M., Hornik, K. & Buchta, C. Getting things in order: an introduction to the R package seriation. J. Stat. Soft. 25, 1–34 (2008).
Article Google Scholar
Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).
Article PubMed PubMed Central Google Scholar
Krishnaswamy, S. et al. Conditional density-based analysis of T cell signaling in single-cell data. Science 346, 1250689 (2014).
Article PubMed PubMed Central CAS Google Scholar
Polo, J. M. et al. A molecular roadmap of reprogramming somatic cells into iPS cells. Cell 151, 1617–1632 (2012).
Article PubMed PubMed Central CAS Google Scholar
Martin, G. R. & Evans, M. J. Differentiation of clonal lines of teratocarcinoma cells: formation of embryoid bodies in vitro. Proc. Natl Acad. Sci. USA 72, 1441–1445 (1975).
Article PubMed CAS PubMed Central Google Scholar
Bibel, M., Richter, J., Lacroix, E. & Barde, Y.-A. Generation of a defined and uniform population of CNS progenitors and neurons from mouse embryonic stem cells. Nat. Protocols 2, 1034–1043 (2007).
Article PubMed CAS Google Scholar
Kang, S.-M. et al. Efficient induction of oligodendrocytes from human embryonic stem cells. Stem Cells 25, 419–424 (2007).
Article PubMed CAS Google Scholar
Zhao, X., Liu, J. & Ahmad, I. Differentiation of embryonic stem cells to retinal cells in vitro. In Embryonic Stem Cell Protocols: Differentiation Models Vol. 2 (Ed. Turksen, K.) 401–416 (Humana Press, 2006).
Liour, S. S. et al. Further characterization of embryonic stem cell-derived radial glial cells. Glia 53, 43–56 (2006).
Article PubMed Google Scholar
Nakano, T., Kodama, H. & Honjo, T. In vitro development of primitive and definitive erythrocytes from different precursors. Science 272, 722 (1996).
Article PubMed CAS Google Scholar
Nishikawa, S.-I., Nishikawa, S., Hirashima, M., Matsuyoshi, N. & Kodama, H. Progressive lineage analysis by cell sorting and culture identifies FLK1⁺ VE-cadherin⁺ cells at a diverging point of endothelial and hemopoietic lineages. Development 125, 1747–1757 (1998).
Article PubMed CAS Google Scholar
Wiles, M. V. & Keller, G. Multiple hematopoietic lineages develop from embryonic stem (ES) cells in culture. Development 111, 259–267 (1991).
Article PubMed CAS Google Scholar
Potocnik, A. J., Nielsen, P. J. & Eichmann, K. In vitro generation of lymphoid precursors from embryonic stem cells. EMBO J. 13, 5274 (1994).
Article PubMed PubMed Central CAS Google Scholar
Tsai, M. et al. In vivo immunological function of mast cells derived from embryonic stem cells: an approach for the rapid analysis of even embryonic lethal mutations in adult mice in vivo. Proc. Natl Acad. Sci. USA 97, 9186–9190 (2000).
Article PubMed CAS PubMed Central Google Scholar
Fairchild, P. et al. Directed differentiation of dendritic cells from mouse embryonic stem cells. Curr. Biol. 10, 1515–1518 (2000).
Article PubMed CAS Google Scholar
Yamashita, J. et al. Flk1-positive cells derived from embryonic stem cells serve as vascular progenitors. Nature 408, 92–96 (2000).
Article PubMed CAS Google Scholar
Maltsev, V. A., Rohwedel, J., Hescheler, J. & Wobus, A. M. Embryonic stem cells differentiate in vitro into cardiomyocytes representing sinusnodal, atrial and ventricular cell types. Mech. Dev. 44, 41–50 (1993).
Article PubMed CAS Google Scholar
Rohwedel, J. et al. Muscle cell differentiation of embryonic stem cells reflects myogenesis in vivo: developmentally regulated expression of myogenic determination genes and functional expression of ionic currents. Dev. Biol. 164, 87–101 (1994).
Article PubMed CAS Google Scholar
Kania, G., Blyszczuk, P., Jochheim, A., Ott, M. & Wobus, A. M. Generation of glycogen- and albumin-producing hepatocyte-like cells from embryonic stem cells. Biol. Chem. 385, 943–953 (2004).
Article PubMed CAS Google Scholar
Schroeder, I. S., Rolletschek, A., Blyszczuk, P., Kania, G. & Wobus, A. M. Differentiation of mouse embryonic stem cells to insulin-producing cells. Nat. Protocols 1, 495–507 (2006).
Article PubMed CAS Google Scholar
Geijsen, N. et al. Derivation of embryonic germ cells and male gametes from embryonic stem cells. Nature 427, 148–154 (2004).
Article PubMed CAS Google Scholar
Kehler, J., Hübner, K., Garrett, S. & Schöler, H. R. Generating oocytes and sperm from embryonic stem cells. Semin. Reprod. Med. 23, 222–233 (2005).
Article PubMed Google Scholar
Betancur, P., Bronner-Fraser, M. & Sauka-Spengler, T. Assembling neural crest regulatory circuits into a gene regulatory network. Annu. Rev. Cell Dev. Biol. 26, 581–603 (2010).
Article PubMed PubMed Central CAS Google Scholar
Barembaum, M. & Bronner-Fraser, M. Early steps in neural crest specification. Semin. Cell Dev. Biol. 16, 642–646 (2005).
Article PubMed CAS Google Scholar
Treleaven, K. & Frazzoli, E. An explicit formulation of the earth movers distance with continuous road map distances. Preprint at arXiv https://arxiv.org/abs/1309.7098 (2013).
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Article PubMed PubMed Central CAS Google Scholar
Nadler, B., Lafon, S., Coifman, R. R. & Kevrekidis, I. Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators. In Proc 18th International Conference on Neural Information Processing Systems (Eds. Weiss, Y. et al.) 955–962 (MIT Press, 2005).
Nadler, B., Lafon, S., Coifman, R. R. & Kevrekidis, I. G. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Appl. Comput Harmon. Anal. 21, 113–127 (2006).
Article Google Scholar
Butterworth, S. On the theory of filter amplifiers. Wireless Engineer 7, 536–541 (1930).
Google Scholar
Neumann, J. Mathematische Grundlagen der Quantenmechanik. (Springer, 1932).
Anand, K., Bianconi, G. & Severini, S. Shannon and von Neumann entropy of random networks with heterogeneous expected degree. Phys. Rev. E 83, 036109 (2011).
Article CAS Google Scholar
Salicrú, M. & Pons, A. A. Sobre ciertas propiedades de la M-divergencia en análisis de datos. Qüestiió 9, 251–256 (1985).
Google Scholar
Salicrú, M., Sanchez, A., Conde, J. & Sanchez, P. Entropy measures associated with K and M divergences. Soochow J. Math. 21, 291–298 (1995).
Google Scholar
Wolf, G., Rotbart, A., David, G. & Averbuch, A. Coarse-grained localized diffusion. Appl. Comput. Harm. Anal. 33, 388–400 (2012).
Article Google Scholar
Platt, J. Fastmap, metricmap, and landmark mds are all Nystrom algorithms. In Proc. 10th International Workshop on Artificial Intelligence and Statistics (Eds. Cowell, R. & Ghahramani, Z.) (AI/Stats, 2005).
Yang, T., Liu, J., McMillan, L. & Wang, W. A fast approximation to multidimensional scaling. In Proc. IEEE Workshop on Computation Intensive Methods for Computer Vision (IEEE, 2006).
Gigante, S. et al. Compressed diffusion. In The 13th International Conference on Sampling Theory and Applications (Bordeaux, France), sampta2019:267712 (2019).
Costa, J. A. & Hero, A. O. III Determining intrinsic dimension and entropy of high-dimensional shape spaces. In Statistics and Analysis of Shapes (Eds Hamid, K. & Yezzi Jr, A) 231–252 (Birkhäuser, 2006).
Carter, K. M., Raich, R. & Hero, A. O. III On local intrinsic dimension estimation and its applications. IEEE Trans. Signal Process. 58, 650–663 (2010).
Article Google Scholar
Levina, E. & Bickel, P. J. Maximum likelihood estimation of intrinsic dimension. In Proc. 18th International Conference on Neural Information Processing Systems (ed. Weiss, Y.) 777–784 (Curran Associates, 2005).
David, G. & Averbuch, A. Hierarchical data organization, clustering and denoising via localized diffusion folders. Appl. Comput. Harmon. Anal. 33, 1–23 (2012).
Article Google Scholar
Rubner, Y., Tomasi, C. & Guibas, L. J. A metric for distributions with applications to image databases. In Proc. IEEE Sixth International Conference on Computer Vision 59–66 (IEEE, 1998).
Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011).
Article PubMed PubMed Central CAS Google Scholar
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article PubMed PubMed Central CAS Google Scholar
Grün, D., Kester, L. & Van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014).
Article PubMed CAS Google Scholar
Balasubramanian, M. & Schwartz, E. L. The isomap algorithm and topological stability. Science 295, 7–7 (2002).
Article PubMed Google Scholar
van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).
Article PubMed PubMed Central CAS Google Scholar
Vieth, B., Ziegenhain, C., Parekh, S., Enard, W. & Hellmann, I. powsimR: power analysis for bulk and single cell rna-seq experiments. Bioinformatics 33, 3486–3488 (2017).
Article PubMed CAS Google Scholar
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093 (2013).
Article PubMed CAS Google Scholar
Hwang, B., Lee, J. H. & Bang, D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 96 (2018).
Article PubMed Central CAS Google Scholar
Kim, J. K., Kolodziejczyk, A. A., Ilicic, T., Teichmann, S. A. & Marioni, J. C. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 6, 8687 (2015).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

This research was supported in part by the Gruber Foundation (to S.G.); the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health (NIH) (award number F31HD097958) (to D.B.B.); an Alfred P. Sloan Fellowship (grant FG-2016-6607); a DARPA Young Faculty Award (grant D16AP00117); National Science Foundation (NSF) grants 1620216, 1912906; an NSF CAREER award (grant 1845856) (to M.J.H.); NIH grant 1R01HG008383-01A1 (to R.R.C.); NIH grant R01GM107092 (to N.B.I.); IVADO (Institut de valorisation des données) (to G.W.); the Chan–Zuckerberg Initiative (grant 182702); NIH grant R01GM130847; the State of Connecticut (grant 16-RMB-YALE-07) (to S.K.); and NIH grant R01GM135929 (to M.J.H., G.W. and S.K.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Author information

These authors contributed equally: Kevin R. Moon, David van Dijk, Zheng Wang, Scott Gigante.
These authors jointly supervised this work: Natalia B. Ivanova, Guy Wolf, Smita Krishnaswamy.

Authors and Affiliations

Department of Mathematics and Statistics, Utah State University, Logan, UT, USA
Kevin R. Moon
Cardiovascular Research Center, section Cardiology, Department of Internal Medicine, Yale University, New Haven, CT, USA
David van Dijk
Department of Computer Science, Yale University, New Haven, CT, USA
David van Dijk & Smita Krishnaswamy
School of Basic Medicine, Qingdao University, Qingdao, China
Zheng Wang
Yale Stem Cell Center, Department of Genetics, Yale University, New Haven, CT, USA
Zheng Wang
Computational Biology and Bioinformatics Program, Yale University, New Haven, CT, USA
Scott Gigante
Department of Genetics, Yale University, New Haven, CT, USA
Daniel B. Burkhardt, William S. Chen, Kristina Yim, Antonia van den Elzen & Smita Krishnaswamy
Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA
Matthew J. Hirn
Department of Mathematics, Michigan State University, East Lansing, MI, USA
Matthew J. Hirn
Applied Mathematics Program, Yale University, New Haven, CT, USA
Ronald R. Coifman
Department of Genetics, Center for Molecular Medicine, University of Georgia, Athens, GA, USA
Natalia B. Ivanova
Department of Mathematics and Statistics, Université de Montréal, Montréal, Quebec, Canada
Guy Wolf
Mila—Quebec Artificial Intelligence Institute, Montréal, Quebec, Canada
Guy Wolf

Authors

Kevin R. Moon
View author publications
You can also search for this author in PubMed Google Scholar
David van Dijk
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Scott Gigante
View author publications
You can also search for this author in PubMed Google Scholar
Daniel B. Burkhardt
View author publications
You can also search for this author in PubMed Google Scholar
William S. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kristina Yim
View author publications
You can also search for this author in PubMed Google Scholar
Antonia van den Elzen
View author publications
You can also search for this author in PubMed Google Scholar
Matthew J. Hirn
View author publications
You can also search for this author in PubMed Google Scholar
Ronald R. Coifman
View author publications
You can also search for this author in PubMed Google Scholar
Natalia B. Ivanova
View author publications
You can also search for this author in PubMed Google Scholar
Guy Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Smita Krishnaswamy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.R.M., S.K., G.W. and D.v.D. envisioned the project. K.R.M., D.v.D., S.G. and G.W. implemented the method. K.R.M., D.v.D., S.G., S.K. and N.B.I. performed the analyses. K.R.M., S.K., G.W., and N.B.I. wrote the paper. D.v.D., S.G. and D.B.B. assisted in writing. D.B.B., W.S.C. and K.Y. assisted in the analysis. K.R.M., G.W., M.J.H. and R.R.C. developed the mathematical foundations of the method. Z.W., A.v.d.E. and N.B.I. were responsible for data acquisition and processing.

Corresponding authors

Correspondence to Natalia B. Ivanova, Guy Wolf or Smita Krishnaswamy.

Ethics declarations

Competing interests

Smita Krishnaswamy serves on the scientific advisory board of AI Therapeutics.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Comparison of PHATE to DM on the artificial tree (n=1440 60-dimensional data points).

(A) PHATE applied to the artificial tree data. Only two PHATE coordinates are needed to separate all branches. (B) The first six diffusion map coordinates of the artificial tree data. At least five of these coordinates are necessary to separate all of the branches.

Supplementary Figure 2 Impact of potential distances and PHATE parameters on the resulting visualization.

(A) Comparison of Diffusion Maps (blue) and PHATE (orange) embeddings on data (black) from a half circle (left, n = 100 data points) and a full circle (right, n = 100 data points). Both the data and the embeddings have been centered about the mean and rescaled by the max Euclidean norm. For the full circle, both embeddings are identical (up to centering & scaling) to the original circle. However, for the half circle, the Diffusion Maps embedding (blue) suffers from instabilities that generate significantly higher densities near the two end points. The PHATE embedding (orange) does not exhibit these instabilities. (B) The α-decaying kernel \(K_{\alpha ,\sigma }\left( x \right) = \exp \left( { - \left( {\frac{{\left| x \right|}}{\sigma }} \right)^\alpha } \right)\) as a function of x for different values of α and σ = 1 (left) and σ = 4 (right). As α increases, \(K_{\alpha ,\sigma }\left( x \right)\) becomes more constant for \(x \in ( - \sigma ,\sigma )\) and the tails of the kernel become lighter (i.e., decay to zero more quickly) for \(x \notin ( - \sigma ,\sigma ).\) (C) Demonstration of the effect of the scale t on the PHATE visualization for the artificial tree data (n = 1440 60-dimensional data points) colored by branch. The first column shows the VNE H(t) (see Eq. 5) of the diffusion affinities as a function of the time scale t. The other columns give the PHATE visualization with different values of t. The red dots in the first column indicate the values of t chosen for the plots. The red dot surrounded by a black box indicate the chosen value of t for the visualization in Figure 1B of the artificial tree data. Values of t that are too low can give noisy visualizations while very high values of t can result in a loss of information in the visualization. (D) Visualization of scRNA-seq data measured from mouse retinal bipolar neurons (Shekhar et al., Cell, vol. 166, no. 5, pp. 1308-1323, 2016), using different informational distances defined via the parameter γ. n = 27499 single cells.

Supplementary Figure 3 Comparison of PHATE to various methods on multiple artificial and non-biological datasets.

Note that methods with strong structural assumptions on the data, such as t-SNE (clusters) and Monocle2 (tree) are expected to fail on the subset of datasets which do not fit their assumptions. See Supplementary Note 2 for discussion. See the figure for the respective sample sizes for each dataset.

Supplementary Figure 4 Visual and quantitative demonstrations of the robustness of PHATE to subsampling and the choice of parameters.

(A) The PHATE visualization for the iPSC mass cytometry dataset from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) with varying number of subsample sizes N. The main branches present for N = 10000 are also visible for the other values of N, demonstrating that the PHATE embedding is robust to the size of the subsample. (B) The PHATE visualization of the same iPSC CyTOF dataset with varying scale parameter t with \(n = 50000\) cells. The embeddings for all t preserve the branching structure and the visualizations are very similar to each other, demonstrating that the embedding is robust to the choice of t. (C) Heatmap of the Spearman correlation coefficient between geodesic distances of the ground truth data and the Euclidean distances of the PHATE visualization applied to the simulated paths dataset using Splatter (Zappia et al., Genome Biology, vol. 18, no. 1, p. 174, 2017). The results are presented using different values for k, t, and α. The value of t selected using the kneepoint method in this case is 8. The number of simulated cells is n = 3000. (D) Heatmap of the Spearman correlation coefficient between geodesic distances of the ground truth data and the Euclidean distances of the PHATE visualization applied to the simulated groups dataset using Splatter. The results are presented using different values for k, t, and α. For both the groups and paths datasets, the results are very stable for \(\alpha \ge 10\). The value of t selected using the kneepoint method in this case is 8. The number of simulated cells is n = 3000.

Supplementary Figure 5 Visual and quantitative demonstrations of the reproducibility of PHATE compared to PCA, tSNE, and UMAP.

Reproducibility was computed on 4 different datasets (4 columns) that were generated using Splatter. The different runs had different random seeds and n = 2000 cells. (A) Boxplots show RMSE computed between 10 runs of each method. RMSE was computed between each unique pair of runs (thus 45 in total) after aligning the pair of embeddings with Procrustes. Thus, RMSE here quantifies how much embeddings change between runs, with lower RMSE signifying greater reproducibility. In the boxplots, the box limits indicate the lower and upper quartile values with a line at the median while the whiskers show the range of the data. (B) For each method (rows) and each dataset (columns) two example runs are shown (orange and blue points) to visually demonstrate the reproducibility. In line with the RMSE boxplots, PHATE and PCA show almost perfectly overlapping embeddings while tSNE and UMAP show significant variability between runs.

Supplementary Figure 6 Scalability tests of PHATE.

(A) Scalable PHATE embedding of iPSC CyTOF data \(\left( {n = 220450\,cells} \right)\) from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) with a subset of the landmarks shown in red (200 out of 2000). (B) Robustness of PHATE to the number of landmarks chosen. PHATE on the EB data (\(n = 16825\) cells) computed using increasing numbers of landmarks (X-axis) was compared to exact PHATE, i.e. without landmarks. Comparison was done using Procrustes analysis (optimal linear transformation) and the sum of squared error (SSE, Y-axis) is shown. To ensure a stable embedding that accurately approximates exact PHATE we choose 2000 landmarks as default. The inset shows the histogram of pairwise distances in the visualization computed using fast PHATE (2000 landmarks) on the EB data vs. the pairwise distances from exact PHATE. The correspondence and the Pearson correlation coefficient are very high. (C) PHATE and t-SNE embeddings of a mouse brain cell dataset from 10X genomics with a large number of cells (\(n = 1,300,774\) cells). The PHATE embedding was calculated with 2000 landmarks and completed in three hours. A subset (10 of 60) of the clusters provided by 10X are shown in color, the rest in gray. t-SNE shatters the cluster structure, while PHATE retains clusters as contiguous groups of cells. (D) Runtime of PHATE, t-SNE and UMAP on increasingly large subsamples of the EB data. Runtime was averaged across four runs. (E) Runtime of 12 visualization methods shown in Figures S3 and S8 across 19 datasets and corresponding line of best fit for each method. Where a method ran out of memory or took longer than one hour, the runtime is not shown and linear fits are cut off accordingly.

Supplementary Figure 7 Annotated PHATE visualizations of CyTOF iPSC data (n = 50000 cells) from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) and branch expression analysis.

(A) The primary branch point between the two major branches (reprogrammed and refractory) of the data is highlighted. (B) The PHATE visualization colored by Lin28 (a marker associated with the transition to pluripotency (Polo et al., Cell, vol. 151, no. 7, pp. 1617-1632, 2012)) and Ccasp3 (associated with cell apoptosis). Lin28 expression is limited to the reprogrammed branch while Ccasp3 is primarily expressed in the refractory branch, indicating that the failure to reprogram may initiate apoptosis in these cells. (C) Analysis of branches on the PHATE embedding for the same iPSC CyTOF data, (D) bone marrow scRNA-seq dataset (\(n = 2730\) cells) from Paul et al. (Cell, vol. 163, no. 7, pp. 1663-1677, 2015), and (E) newly generated embryoid body scRNA-seq data (\(n = 16825\) cells). (Left) The PHATE visualization with identified branches. (Middle) Expression level for each cell ordered by branch and ordering within the branch. Cell ordering is calculated using Wanderlust (Bendall et al., Cell, vol. 157, no. 3, pp. 714-725, 2014) starting on the left-most point of each branch. Expression levels are z-scored for each gene. A colorbar is given below the expression matrices that identifies each branch and (in the case of the bone marrow scRNA-seq data) cell type. (Right) DREMI scores (Krishnaswamy et al., Science, vol. 346, no. 6213, p. 1250689, 2014) between gene expression levels and cell order within each branch. MAGIC (van Dijk et al., Cell, vol. 174, no. 3, pp. 716-729, 2018) is applied first in (D) and (E) to impute missing values using the same kernel used for PHATE and smaller t. For branch analysis of the bone marrow data in (D), we used 3 PHATE dimensions to obtain clearer branch separation.

Supplementary Figure 8 Comparison of PHATE to various methods on multiple biological datasets.

Supplementary Figure 9 PHATE preserves separations and cluster structure in addition to continuum structure.

To quantify the ability of PHATE to preserve cluster structure, we generated 30 random datasets with cluster structure using the Splatter package (Zappia et al., Genome Biology, vol. 18, no. 1, p. 174, 2017). Each dataset has \(n = 2000\) cells and between 7 and 14 clusters. We then computed the Adjusted Rand Index (Rand, Journal of the American Statistical Association, vol. 66, no. 336, pp. 846-850, 1971) (ARI, y-axis) between the ground truth clusters and clusters obtained by running k-means clustering on the embeddings. An ARI of 1 means perfect recovery of the clusters. We performed this analysis on Splatter data with increasing amounts of noise added during generation. For each noise level we compare clustering on the raw data, on 2-dimensional PCA, 2D t-SNE, 2D UMAP, and 2D PHATE. On average, PHATE preserves local cluster structure as well or better than the other methods. In the boxplots, the box limits indicate the lower and upper quartile values with a line at the median while the whiskers show the range of the data.

Supplementary Figure 10 PHATE reveals structure in a variety of high-dimensional datasets.

(A) A 3D PHATE visualization of the Frey Faces dataset (\(n = 1965\) images) used in Roweis and Saul (Science, vol. 290, no. 5500, pp. 2323-2326, 2000). Points are colored by time within the video. Multiple branches corresponding to different poses are clearly visible. (B) PCA and PHATE embeddings of microbiome data from the American Gut project (\(n = 9660\) human samples), colored by body site, and branches annotated by their dominant genera or phyla. (C) The PHATE embedding of the same data from the American Gut project colored by 2 genera (bacteroides and prevotella) and a phylum (actinobacteria) of bacteria. (D) The PHATE embedding of only the fecal samples from the American Gut project (\(n = 8596\)) colored by various genera (bacteroides and prevotella) and phyla (firmicutes, verrucomicrobia, and proteobacteria) of bacteria. Each PHATE branch is associated with one of these bacteria groups. (E) PCA and PHATE embeddings of SNP data from the Human Origins dataset (\(n = 2345\) present-day humans) showing genotyped present-day humans from 203 populations (Patterson et al., Genetics, vol. 192, no. 3, pp. 1065-1093, 2012) with the population legend in (F).

Supplementary Figure 11 PHATE reveals structure in a variety of connectivity datasets.

(A) 3D PHATE visualization of human Hi-C data (Darrow et al., Proceedings of the National Academy of Sciences, p. 201609643, 2016) using all 23 chromosomes at 50 kb resolution (\(n = 56702\) locations on the chromosomes), colored by chromosome. Each point corresponds to a genomic fragment. (B) PHATE visualizations of the same human Hi-C data in A for chromosome 1 at 10 kb resolution colored by chromosome location (\(n = 22128\) chromosome locations). (C) 2D PHATE visualization of the same human Hi-C data for chromosome 1 at 10 kb resolution, colored by selected chromatin modification markers from ChIP-seq data (\(n = 22128\) chromosome locations). (D) Force-directed layout and PHATE visualizations of Facebook network data with data points colored by their degree (number of connections). The subnetworks are taken from the friend networks of selected individuals within the entire network. In all cases, PHATE reveals more structure. For the entire network, \(n = 3927\) nodes. For subnetworks 1 and 2, \(n = 1034\) and 532 nodes, respectively.

Supplementary Figure 12 Additional analysis with PHATE on scRNA-seq data measured from mouse retinal bipolar neurons from Shekhar et al. (Cell, vol. 166, no. 5, pp. 1308-1323, 2016).

(A) i. Initial PHATE embedding (\(n = 27499\) cells). The rod bipolar cells cluster (cluster 1) is circled. ii. Subsequent PHATE embedding of cluster 1, colored by k-means clustering to show heterogeneity within rod bipolar cells (\(n = 10889\) cells). (B) Transcriptional characterization of subtypes of rod bipolar cells from cluster 1, using known bipolar cell markers.

Supplementary Figure 13 PHATE using reweighted distances to highlight specific biological processes or “views” of the data.

(A) PHATE embedding of the CyTOF iPSC data (\(n = 220450\)) from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) using (i) unweighted distances, (ii) distances after upweighting cell cycle markers, (iii) distances after upweighting stem cell markers, (iv) distances after upweighting mitotic markers. (B) PHATE embedding of the same dataset colored by different markers (columns). From top to bottom: (i) PHATE cell cycle “view”, (ii) PHATE stem cell “view” (iii) PHATE mitotic “view”.

Supplementary Figure 14 Further analysis of the EB scRNA-seq data.

(A) Inverted images of hESCs and EBs at each timepoint of data collection. Structures of different densities are clearly visible late in the time course (D15-D27) indicating the formation of distinct cell types. The experiments were repeated independently n = 3 times. (B) The PHATE embedding of the EB data (\(n = 16825\) cells) colored by expression levels of selected markers. (C) Heatmap showing gene expression level in each cell in four of the branches starting with ESC. The number of cells in each branch is \(n = 2294,9507,5543\), and 4938 for the EN, ME, NE, and NC branches, respectively. Cell ordering is determined using Wanderlust (Bendall et al., Cell, vol. 157, no. 3, pp. 714-725, 2014). Genes were selected either manually or by high DREMI scores (Krishnaswamy et al., Science, vol. 346, no. 6213, p. 1250689, 2014) between gene expression and cell ordering. (D) The PHATE embedding of the EB data (\(n = 16825\) cells) colored by CD49d expression level from the scRNA-seq data (top) and by Spearman correlation between the scRNA-seq transcription factor expression and the CD49d-sorted bulk RNA-seq transcription factor expression per cell (bottom, n = 1213 transcription factors). (E) Same as (D), with CD142 and CD82. The Spearman correlation coefficient is highest in branch vii, which is the branch with the highest CD142 and CD82 expression. Bottom right: Scatter plot of single cell expression levels (\(n = 16825\) cells) between CD82 and CD142. Color corresponds to the Spearman correlation between the scRNA-seq expression and the CD142+CD82+ sorted bulk RNA-seq expression (\(n = 15111\) genes). The branch with highest correlation corresponds to cells that are positive in both CD142 and CD82. (F) Scatter plots showing the gating procedure for FACS sorting cell populations of sub-branch iii (CD49d and CD63) and sub-branch vii (CD82 and CD142). The experiments were repeated independently n = 3 times.

Supplementary information

Supplementary Materials

Supplementary Figs. 1–14, Supplementary Tables 1–5 and Supplementary Notes 1–4.

Reporting Summary

Supplementary Video 1

The mesoderm branch. Rotating video of 3D PHATE visualizations of the mesoderm branch of the EB scRNA-seq data colored by the geometric mean of selected genes at each stage of the lineage specification tree in Fig. 6b.

Supplementary Video 2

Supplementary Video 3

The neuroectoderm branches. Rotating video of 3D PHATE visualizations of the neuroectoderm branches of the EB scRNA-seq data colored by the geometric mean of selected genes at each stage of the lineage specification tree in Fig. 6b.

Supplementary Video 4

PHATE visualizing the Frey Face dataset. Video showing the PHATE visualization (left) for the Frey Face dataset used by Roweis and Saul⁶ (right). PHATE reveals multiple branches in the data that correspond to different poses. Two of the branches are highlighted in this video. The corresponding point in the PHATE visualization is highlighted as the video progresses.

Supplementary Video 5

PHATE visualizing chromosome 1 in Hi-C data. Rotating 3D PHATE visualization of chromosome 1 in the Hi-C data from Darrow et al.¹⁵ at a resolution of 10 kilobases. Multiple folds are clearly visible in the visualization.

Supplementary Video 6

PHATE visualizing all chromosomes in Hi-C data. Rotating 3D PHATE visualization of all chromosomes in the Hi-C data from Darrow et al.¹⁵ at a resolution of 50 kilobases. The embedding resembles the fractal globule structure proposed in Lieberman-Aiden et al.⁵⁷.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moon, K.R., van Dijk, D., Wang, Z. et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol 37, 1482–1492 (2019). https://doi.org/10.1038/s41587-019-0336-3

Download citation

Received: 02 October 2018
Accepted: 29 October 2019
Published: 03 December 2019
Issue Date: December 2019
DOI: https://doi.org/10.1038/s41587-019-0336-3

This article is cited by

Nonlinear dimensionality reduction based visualization of single-cell RNA sequencing data
- Mohamed Yousuff
- Rajasekhara Babu
- Anand Rathinam
Journal of Analytical Science and Technology (2024)
Benchmarking differential abundance methods for finding condition-specific prototypical cells in multi-sample single-cell datasets
- Haidong Yi
- Alec Plotkin
- Natalie Stanley
Genome Biology (2024)
Archetype analysis and the PHATE algorithm as methods to describe and visualize pregnant women’s levels of physical activity knowledge
- Marek Karwański
- Urszula Grzybowska
- Katarzyna Szamotulska
BMC Public Health (2024)
StaVia: spatially and temporally aware cartography with higher-order random walks for cell atlases
- Shobana V. Stassen
- Minato Kobashi
- Kevin K. Tsia
Genome Biology (2024)
Metric multidimensional scaling for large single-cell datasets using neural networks
- Stefan Canzar
- Van Hoan Do
- Tomislav Prusina
Algorithms for Molecular Biology (2024)