Functional summaries of persistence diagrams

985 Accesses
5 Altmetric
Explore all metrics

Abstract

One of the primary areas of interest in applied algebraic topology is persistent homology, and, more specifically, the persistence diagram. Persistence diagrams have also become objects of interest in topological data analysis. However, persistence diagrams do not naturally lend themselves to statistical goals, such as inferring certain population characteristics, because their complicated structure makes common algebraic operations—such as addition, division, and multiplication—challenging (e.g., the mean might not be unique). To bypass these issues, several functional summaries of persistence diagrams have been proposed in the literature (e.g. landscape and silhouette functions). The problem of analyzing a set of persistence diagrams then becomes the problem of analyzing a set of functions, which is a topic that has been studied for decades in statistics. First, we review the various functional summaries in the literature and propose a unified framework for the functional summaries. Then, we generalize the definition of persistence landscape functions, establish several theoretical properties of the persistence functional summaries, and demonstrate and discuss their performance in the context of classification using simulated prostate cancer histology data, and two-sample hypothesis tests comparing human and monkey fibrin images, after developing a simulation study using a new data generator we call the Pickup Sticks Simulator (STIX).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiscale Persistent Functions for Biomolecular Structure Characterization

Article 02 November 2017

On the expectation of a persistence diagram by the persistence weighted kernel

Article 25 July 2019

A random persistence diagram generator

Article 07 October 2022

Notes

Actually, as long as we have a filtration, we can define a persistence diagram.
Some literature considers the lower level set $f^{-1}((-\infty , \lambda ))$. Both definitions are valid and here we use upper level sets because they yield a very straight forward definition when we consider the image data or certain functions estimated from a point cloud, such as kernel density estimates (Wasserman 2006).
Adams et al. (2017) also presented conditions under which persistence images are stable with respect to changes in the corresponding persistence diagrams.
A finitely discrete probability measure Q puts probability mass only finitely many points in ${\mathcal {B}}_F$.
While Fasy et al. (2018) will release a version of the gland generator that is fully verified with a pathologist, the preliminary version of this software that we used for the experiments in this paper will be made available upon request.
R (R Core Team 2017) is used to produce the STIX images, and the thickness t of the segments are set by the lwd plotting option.
The p-Wasserstein distance between persistence diagrams $D_1$ and $D_2$ is $d_p(D_1, D_2) = \left( \inf _{\phi :D_1 \rightarrow D_2}\sum _{d\in D_1}\Vert d - \phi (d)\Vert _p^p\right) ^{1/p}$, where $\phi $ is a bijection between the two persistence diagrams. See Robinson and Turner (2017) for more details.
This is important to ensure that the tests do not consistently incorrectly reject the null hypothesis.

References

Abdollahi, A., Meysamie, A., Sheikhbahaei, S., Ahmadi, A., Tabriz, H.M., Bakhshandeh, M., Hosseinzadeh, H.: Inter/intra-observer reproducibility of Gleason scoring in prostate adenocarcinoma in Iranian pathologists. Urol. J. 9(2), 486–490 (2012)
Google Scholar
Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., Ziegelmeier, L.: Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18(1), 218–252 (2017)
MathSciNet MATH Google Scholar
Adcock, A., Carlsson, E., Carlsson, G.: The ring of algebraic functions on persistence bar codes. Homol. Homotopy Appl. 18(1), 381–402 (2016)
MathSciNet MATH Google Scholar
Adler, R.J.: On excursion sets, tube formulas and maxima of random fields. Ann. Appl. Probab. 10, 1–74 (2000)
MathSciNet MATH Google Scholar
Adler, R.J., Agami, S.: Modelling persistence diagrams with planar point processes, and revealing topology with bagplots. J. Appl. Comput. Topol. 3(3), 139–183 (2019)
MathSciNet MATH Google Scholar
Adler, R.J., Bobrowski, O., Borman, M.S., Subag, E., Weinberger, S.: Persistent homology for random fields and complexes. In: Borrowing Strength: Theory Powering Applications—A Festschrift for Lawrence D. Brown, Vol. 6, pp. 124–143. Institute of Mathematical Statistics (2010)
Bendich, P., Marron, J.S., Miller, E., Pieloch, A., Skwerer, S.: Persistent homology analysis of brain artery trees. Ann. Appl. Stat. 10(1), 198 (2016)
MathSciNet Google Scholar
Biscio, C., Møller, J.: The accumulated persistence function, a new useful functional summary statistic for topological data analysis, with a view to brain artery trees and spatial point process applications. arXiv preprint arXiv:1611.00630 (2016)
Bubenik, P.: Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16(1), 77–102 (2015)
MathSciNet MATH Google Scholar
Campbell, R.A., Overmyer, K.A., Selzman, C.H., Sheridan, B.C., Wolberg, A.S.: Contributions of extravascular and intravascular cells to fibrin network formation, structure, and stability. Blood 114(23), 4886–4896 (2009)
Google Scholar
Carlsson, G.: Topology and data. Bull. Am. Math. Soc. 46(2), 255–308 (2009)
MathSciNet MATH Google Scholar
Carlsson, G., Zomorodian, A., Collins, A., Guibas, L.J.: Persistence barcodes for shapes. Int. J. Shape Model. 11(2), 149–187 (2005)
Center, M.M., Jemal, A., Lortet-Tieulent, J., Ward, E., Ferlay, J., Brawley, O., Bray, F.: International variation in prostate cancer incidence and mortality rates. Eur. Urol. 61(6), 1079–1092 (2012)
Google Scholar
Chazal, F., Fasy, B.T., Lecci, F., Rinaldo, A., Wasserman, L.: Stochastic convergence of persistence landscapes and silhouettes. In: Proceedings of the Thirtieth Annual Symposium on Computational Geometry, p. 474. ACM (2014)
Chen, Y.-C., Wang, D., Rinaldo, A., Wasserman, L.: Statistical analysis of persistence intensity functions. arXiv preprint arXiv:1510.02502 (2015)
Ciollaro, M., Genovese, C., Lei, J., Wasserman, L.: The functional mean-shift algorithm for mode hunting and clustering in infinite dimensions. arXiv preprint arXiv:1408.1187 (2014)
Cisewski, J., Croft, R.A., Freeman, P.E., Genovese, C.R., Khandai, N., Ozbek, M., Wasserman, L.: Non-parametric 3D map of the intergalactic medium using the Lyman-alpha forest. Mon. Not. R. Astron. Soc. 440(3), 2599–2609 (2014)
Google Scholar
Cohen-Steiner, D., Edelsbrunner, H., Harer, J.: Stability of persistence diagrams. Discrete Comput. Geom. 37(1), 103–120 (2007)
MathSciNet MATH Google Scholar
Crawford, L., Monod, A., Chen, A.X., Mukherjee, S., Rabadán, R.: Functional data analysis using a topological summary statistic: the smooth Euler characteristic transform. arXiv preprint arXiv:1611.06818 (2016a)
Crawford, L., Monod, A., Chen, A.X., Mukherjee, S., Rabadán, R.: Topological summaries of tumor images improve prediction of disease free survival in glioblastoma multiforme. arXiv preprint arXiv:1611.06818 (2016b)
Degras, D.A.: Simultaneous confidence bands for nonparametric regression with functional data. Stat. Sin. 21, 1735–1765 (2011)
MathSciNet MATH Google Scholar
Edelsbrunner, H., Harer, J.: Persistent homology—a survey. Contemp. Math. 453, 257–282 (2008)
MathSciNet MATH Google Scholar
Edelsbrunner, H., Letscher, D., Zomorodian, A.: Topological persistence and simplification. Discrete Comput. Geom. 28(4), 511–533 (2002)
MathSciNet MATH Google Scholar
Edelsbrunner, H., Morozov, D.: Persistent homology: theory and practice. In: Proceedings of the European Congress of Mathematics, pp. 31–50 (2012)
Engers, R.: Reproducibility and reliability of tumor grading in urological neoplasms. World J. Urol. 25(6), 595–605 (2007)
Google Scholar
Epstein, J.I., Egevad, L., Amin, M.B., Delahunt, B., Srigley, J.R., Humphrey, P.A., Committee, G.: The 2014 International Society of Urological Pathology (ISUP) consensus conference on Gleason grading of prostatic carcinoma: definition of grading patterns and proposal for a new grading system. Am. J. Surg. Pathol. 40(2), 244–252 (2016)
Google Scholar
Evans, S.M., Patabendi Bandarage, V., Kronborg, C., Earnest, A., Millar, J., Clouston, D.: Gleason group concordance between biopsy and radical prostatectomy specimens: a cohort study from Prostate Cancer Outcome Registry-Victoria. Prostate Int. 4(4), 145–151 (2016)
Google Scholar
Fasy, B.T., Lecci, F., Rinaldo, A., Wasserman, L., Balakrishnan, S., Singh, A.: Confidence sets for persistence diagrams. Ann. Stat. 42(6), 2301–2339 (2014)
MathSciNet MATH Google Scholar
Fasy, B.T., Payne, S., Schenfish, A., Schupback, J., Stouffer, N.: Simulating prostate cancer slide scans (2018) (forthcoming)
Ferlay, J., Soerjomataram, I., Dikshit, R., Eser, S., Mathers, C., Rebelo, M., Parkin, D.M., Forman, D., Bray, F.: Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136(5), E359–E386 (2015)
Google Scholar
Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York (2006)
MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001)
MATH Google Scholar
Gameiro, M., Mischaikow, K., Kalies, W.: Topological characterization of spatial-temporal chaos. Phys. Rev. E 70(3), 035203 (2004)
MathSciNet Google Scholar
Ghrist, R.: Barcodes: the persistent topology of data. Bull. Am. Math. Soc. 45(1), 61–75 (2008)
MathSciNet MATH Google Scholar
Ghrist, R.W.: Elementary Applied Topology. Createspace, Seattle (2014)
MATH Google Scholar
Goodman, M., Ward, K.C., Osunkoya, A.O., Datta, M.W., Luthringer, D., Young, A.N., Marks, K., Cohen, V., Kennedy, J.C., Haber, M.J., Amin, M.B.: Frequency and determinants of disagreement and error in gleason scores: a population-based study of prostate cancer. Prostate 72(13), 1389–1398 (2012)
Google Scholar
Helpap, B., Kristiansen, G., Beer, M., Köllermann, J., Oehler, U., Pogrebniak, A., Fellbaum, C.: Improving the reproducibility of the gleason scores in small foci of prostate cancer—suggestion of diagnostic criteria for glandular fusion. Pathol. Oncol. Res. 18(3), 615–621 (2012)
Google Scholar
Humphrey, P.A.: Gleason grading and prognostic factors in carcinoma of the prostate. Mod. Pathol. 17(3), 292–306 (2004)
MathSciNet Google Scholar
Ieva, F., Paganoni, A., Pigoli, D., Vitelli, V.: Multivariate functional clustering for the analysis of ECG curves morphology. In: Cladag 2011 (8th International Meeting of the Classification and Data Analysis Group), pp. 1–4 (2011)
Jacques, J., Preda, C.: Functional data clustering: a survey. Adv. Data Anal. Classif. 8(3), 231–255 (2014)
MathSciNet MATH Google Scholar
Kerber, M., Morozov, D., Nigmetov, A.: Geometry helps to compare persistence diagrams. J. Exp. Algorithm. JEA 22(1), 1–4 (2017)
MathSciNet MATH Google Scholar
Khasawneh, F.A., Munch, E.: Exploring equilibria in stochastic delay differential equations using persistent homology. In: ASME 2014 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, pp. V008T11A034–V008T11A034. American Society of Mechanical Engineers (2014)
Khasawneh, F.A., Munch, E.: Chatter detection in turning using persistent homology. Mech. Syst. Signal Process. 70, 527–541 (2016)
Google Scholar
Kosorok, M.R.: Introduction to Empirical Processes and Semiparametric Inference. Springer, New York (2007)
MATH Google Scholar
Lamprecht, M.R., Sabatini, D.M., Carpenter, A.E.: CellProfiler: free, versatile software for automated biological image analysis. Biotechniques 42(1), 71–75 (2007)
Google Scholar
Lawson, P., Berry, E., Brown, J.Q., Fasy, B.T., Wenk, C.: Topological descriptors for quantitative prostate cancer morphology analysis. In: Conference on Digital Pathology, Part of SPIE Medical Imaging. Honorable Mention Poster Award (2017)
Lawson, P., Sholl, A.B., Brown, J.Q., Fasy, B.T., Wenk, C. Persistent Homology for the quantitative evaluation of architectural features in prostate cancer histology. Sci. Rep. 9(1), 1–15 (2019)
Lei, J., Rinaldo, A., Wasserman, L.: A conformal prediction approach to explore functional data. Ann. Math. Artif. Intell. 74(1–2), 29–43 (2015)
MathSciNet MATH Google Scholar
Li, B., Yu, Q.: Classification of functional data: a segmentation approach. Comput. Stat. Data Anal. 52(10), 4790–4800 (2008)
MathSciNet MATH Google Scholar
Ma, S., Yang, L., Carroll, R.J.: A simultaneous confidence band for sparse longitudinal regression. Stat. Sin. 22, 95–122 (2012)
MathSciNet MATH Google Scholar
Mileyko, Y., Mukherjee, S., Harer, J.: Probability measures on the space of persistence diagrams. Inverse Prob. 27(12), 124007 (2011)
MathSciNet MATH Google Scholar
Monod, A., Kališnik, S., Patiño-Galindo, J.Á., Crawford, L.: Tropical sufficient statistics for persistent homology with a parametric application to infectious viral disease. SIAM J. Appl. Algebra Geom. 3(2), 337–371 (2019)
MathSciNet MATH Google Scholar
Munkres, J.R.: Algebraic Topology. Prentice Hall, Upper Saddle River (1964)
MATH Google Scholar
Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H., Hahn, U.: Global envelope tests for spatial processes. J. R. Stat. Soc. Ser. B Stat. Methodol. 79(2), 381–404 (2017)
MathSciNet MATH Google Scholar
Obayashi, I., Hiraoka, Y., Kimura, M.: Persistence diagrams with linear machine learning models. J. Appl. Comput. Topol. 1(3–4), 421–449 (2018)
MathSciNet MATH Google Scholar
Padellini, T., Brutti, P.: (2017) Persistence flamelets: Multiscale persistent homology for kernel density exploration. arXiv preprint arXiv:1709.07097
Perea, J.A., Harer, J.: Sliding windows and persistence: an application of topological methods to signal analysis. Found. Comput. Math. 15(3), 799–838 (2015)
MathSciNet MATH Google Scholar
Pretorius, E., Vieira, W., Oberholzer, H., Auer, R.: Comparative scanning electron microscopy of platelets and fibrin networks of human and differents animals. Int. J. Morphol. 27(1), 69–76 (2009)
Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2017)
Google Scholar
Ramsay, J.O.: Functional Data Analysis. Wiley, New York (2006)
Google Scholar
Robins, V.: Computational topology for point data: betti numbers of $\alpha $-shapes. In: Mecke, K.R., Stoyan, D. (eds.) Morphology of Condensed Matter, pp. 261–274. Springer, New York (2002)
Google Scholar
Robinson, A., Turner, K.: Hypothesis testing for topological data analysis. J. Appl. Comput. Topol. 1(2), 241–261 (2017)
MathSciNet MATH Google Scholar
Rossi, F., Villa, N.: Support vector machine for functional data classification. Neurocomputing 69(7–9), 730–742 (2006)
Google Scholar
Rubin, H.: Uniform convergence of random functions with applications to statistics. Ann. Math. Stat. 27(1), 200–203 (1956)
MathSciNet MATH Google Scholar
Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. 9(Mar), 371–421 (2008)
MathSciNet MATH Google Scholar
Singh, N., Couture, H.D., Marron, J.S., Perou, C., Niethammer, M.: Topological descriptors of histology images. In: International Workshop on Machine Learning in Medical Imaging, pp. 231–239. Springer (2014)
Sousbie, T., Pichon, C., Kawahara, H.: The persistent cosmic web and its filamentary structure–II. Illustrations. Mon. Not. R. Astron. Soc. 414(1), 384–403 (2011)
Google Scholar
Tarpey, T., Kinateder, K.K.: Clustering functional data. J. Classif. 20(1), 093–114 (2003)
MathSciNet MATH Google Scholar
Topaz, C.M., Ziegelmeier, L., Halverson, T.: Topological data analysis of biological aggregation models. PLoS ONE 10(5), e0126383 (2015)
Google Scholar
Truesdale, M.D., Cheetham, P.J., Turk, A.T., Sartori, S., Hruby, G.W., Dinneen, E.P., Benson, M.C., Badani, K.K.: Gleason score concordance on biopsy-confirmed prostate cancer: is pathological re-evaluation necessary prior to radical prostatectomy? BJU Int. 107(5), 749–754 (2011)
Google Scholar
Truong, M., Slezak, J.A., Lin, C.P., Iremashvili, V., Sado, M., Razmaria, A.A., Leverson, G., Soloway, M.S., Eggener, S.E., Abel, E.J., Downs, T.M., Jarrard, D.F.: Development and multi-institutional validation of an upgrading risk tool for Gleason 6 prostate cancer. Cancer 119(22), 3992–4002 (2013)
Google Scholar
Turner, K.: Means and medians of sets of persistence diagrams. arXiv preprint arXiv:1307.8300 (2013)
Turner, K., Mileyko, Y., Mukherjee, S., Harer, J.: Fréchet means for distributions of persistence diagrams. Discrete Comput. Geom. 52(1), 44–70 (2014)
MathSciNet MATH Google Scholar
Turner, K., Mukherjee, S., Boyer, D.M.: Persistent homology transform for modeling shapes and surfaces. Inf. Inference J. IMA 3(4), 310–344 (2014)
MathSciNet MATH Google Scholar
Van de Weygaert, R., Vegter, G., Edelsbrunner, H., Jones, B.J., Pranav, P., Park, C., Hellwing, W.A., Eldering, B., Kruithof, N., Bos, E., et al.: Alpha, betti and the megaparsec universe: on the topology of the cosmic web. In: Transactions on Computational Science XIV, pp. 60–101. Springer (2011)
Van der Vaart, A.W.: Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics), vol. 3. Cambridge University Press, Cambridge (2000)
Google Scholar
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
MathSciNet Google Scholar
Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer, New York (2005)
MATH Google Scholar
Wang, J.-L., Chiou, J.-M., Müller, H.-G.: Review of functional data analysis. Annu. Rev. Stat. Appl. 3, 257–295 (2016)
Google Scholar
Wasserman, L.: All of Nonparametric Statistics. Springer, New York (2006)
MATH Google Scholar
Wasserman, L.: Topological data analysis. Annual Review of Statistics and Its Application 5, 501–532 (2016)
MathSciNet Google Scholar
Worsley, K.J.: The geometry of random images. Chance 9(1), 27–40 (1996)
Google Scholar
Yuan, K.-H.: A theorem on uniform convergence of stochastic functions with applications. J. Multivar. Anal. 62(1), 100–109 (1997)
MathSciNet MATH Google Scholar
Zhu, X.: Persistent homology: an introduction and a new text representation for natural language processing. In: IJCAI, pp. 1953–1959 (2013)
Zomorodian, A.J.: Topology for Computing. Cambridge Monographs, vol. 16. Cambridge University Press, Cambridge (2005)
MATH Google Scholar

Download references

Acknowledgements

Berry acknowledges support from NSF DMS 1812055 and DMS 1945639. Berry and Fasy acknowledge supported by NIH and NSF under Grant Numbers NSF DMS 1664858 and NSF DMS 1557716, as well as the research teams at MSU and Tulane University working on the prostate cancer histology project. Chen is supported by NIH U01 AG016976 and NSF DMS 1810960. Cisewski-Kehe thanks the Yale Center for Research Computing for guidance and use of the research computing infrastructure. Cisewski-Kehe and Fasy acknowledge support from NSF under Grant Numbers DMS 1854220 and 1854336. Fasy is additionally supported by NSF CCF 1618605.

Author information

Authors and Affiliations

Department of Mathematical Sciences, Montana State University, Bozeman, USA
Eric Berry
Department of Statistics, University of Washington, Seattle, USA
Yen-Chi Chen
Department of Statistics and Data Science, Yale University, New Haven, USA
Jessi Cisewski-Kehe
School of Computing and Department of Mathematical Sciences, Montana State University, Bozeman, USA
Brittany Terese Fasy

Authors

Eric Berry
View author publications
You can also search for this author in PubMed Google Scholar
Yen-Chi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jessi Cisewski-Kehe
View author publications
You can also search for this author in PubMed Google Scholar
Brittany Terese Fasy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brittany Terese Fasy.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proofs

The proofs from the propositions of Sect. 3 are presented below.

Proof of Proposition 1

This proof uses ideas from Rubin (1956) and Yuan (1997).

Let $\epsilon >0$ be given and ${\widehat{F}}_n = {\widehat{F}} = \frac{1}{n}\sum _{i=1}^n F_i$. Because ${\mathcal {B}}_F$ is equicontinuous, the collection of differences

$$\begin{aligned} {\mathcal {B}}_{\varDelta } = \{{\widehat{F}}_n - {\bar{F}}:n = 1,2,\ldots \} \end{aligned}$$

is also equicontinuous. Since ${\mathbb {T}}$ is compact, there exists a number $M>0$ and points $t_1,\ldots ,t_M\in {\mathbb {T}}$ such that

$$\begin{aligned} \sup _{\varDelta \in {\mathcal {B}}_{\varDelta }} \min _{j =1,\ldots ,M}|\varDelta (t)-\varDelta (t_j)| < \epsilon /2. \end{aligned}$$

Namely, $t_1,\ldots ,t_M$ forms a $\epsilon /2$-covering of ${\mathcal {B}}_{\varDelta }$.

Let $\varDelta _n(t) = {\widehat{F}}(t)-{\bar{F}}(t)$, and note that $\varDelta _n\in {\mathcal {B}}_{\varDelta }$. Then

$$\begin{aligned} \sup _{t\in {\mathbb {T}}}|{\widehat{F}}(t)-{\bar{F}}(t)|&= \sup _{t\in {\mathbb {T}}}|\varDelta _n(t)|\\&\le \sup _{t\in {\mathbb {T}}} \min _{j\in 1,\ldots ,M} |\varDelta _n(t)-\varDelta _n(t_j)| + \sup _{j\in 1,\ldots ,M} |\varDelta _n(t_j)|\\&\le \epsilon /2 + \sup _{j\in 1,\ldots ,M} |\varDelta _n(t_j)|. \end{aligned}$$

By (5), every $t\in {\mathbb {T}}$ satisfies ${{\mathbb {E}}}|F_i(t)|<\infty $ so by the strong law of large numbers,

$$\begin{aligned} |\varDelta _n(t_j)| \overset{a.s.}{\rightarrow } 0 \end{aligned}$$

for every $j=1,\ldots ,M$ and this implies that for any $\delta >0$, there exists $N>0$ such that

$$\begin{aligned} P\left( \sup _{n\ge N}|\varDelta _n(t_j)|>\frac{\epsilon }{2M}\right) <\frac{\delta }{M} \end{aligned}$$

for every $j=1,\ldots ,M$.

As a result,

$$\begin{aligned} P\left( \sup _{t\in {\mathbb {T}}}|{\widehat{F}}(t)-{\bar{F}}(t)|>\epsilon \right)&\le P\left( \epsilon /2 + \sup _{j\in 1,\ldots ,M} |\varDelta _n(t_j)|>\epsilon \right) \\&\le P\left( \sup _{n\ge N} \sup _{j\in 1,\ldots ,M}|\varDelta _n(t_j)|>\frac{\epsilon }{2}\right) \\&\le \sum _{j=1}^M P\left( \sup _{n\ge N}|\varDelta _n(t_j)|>\frac{\epsilon }{2M}\right) \le \delta \end{aligned}$$

and the result follows. $\square $

Proof of Corollary 1

The case of persistence landscapes is just a special case of the generalized landscape so we will only prove the case of persistence intensity and generalized landscape. Note that Chazal et al. (2014) already show that persistence silhouettes with a power-weighted silhouette is 1-Lipschitz so the equicontinuity follows so we ignore this case.

Persistence intensity Given a persistence diagram ${\textsf {D}}$, the persistence intensity is the function

$$\begin{aligned} F(t,s;{\textsf {D}})={\mathbb {F}}({\textsf {D}};t,s) = \frac{1}{\left| {\textsf {D}}\right| }\sum _{j=1}^{\left| {\textsf {D}}\right| }\omega (d_j-b_j) \cdot K\left( \frac{\sqrt{|b_j-t|^2+|d_j-s|^2}}{h}\right) , \end{aligned}$$

Thus, for any $(t_1,s_1)$ and $(t_2,s_2)$,

$$\begin{aligned} \left| F(t_1,s_1;{\textsf {D}}) - F(t_2,s_2;{\textsf {D}}) \right|&= \left| \frac{1}{\left| {\textsf {D}}\right| }\sum _{j=1}^{\left| {\textsf {D}}\right| }\omega (d_j-b_j) \cdot \left\{ K\left( \frac{\sqrt{|b_j-t_1|^2+|d_j-s_1|^2}}{h}\right) \right. \right. \\&\quad \left. \left. - K\left( \frac{\sqrt{|b_j-t_2|^2+|d_j-s_2|^2}}{h}\right) \right\} \right| \\&\le \sup _{a,b\in {\mathbb {T}}}\Vert \omega (a,b)\Vert \times \left| K\left( \frac{\sqrt{|b_j-t_1|^2+|d_j-s_1|^2}}{h}\right) \right. \\&\quad \left. - K\left( \frac{\sqrt{|b_j-t_2|^2+|d_j-s_2|^2}}{h}\right) \right| \\&\le \sup _{a,b\in {\mathbb {T}}}\Vert \omega (a,b)\Vert L_0 \cdot \frac{\sqrt{(t_1-t_2)^2+(s_1-s_2)^2}}{h}, \end{aligned}$$

where $L_0$ is the Lipschitz constant of the kernel function K, i.e., $|K(x)-K(y)|\le L_0|x-y| $. The bound $\sup _{a,b\in {\mathbb {T}}}\Vert \omega (a,b)\Vert L_0/h$ is uniform for all $(t_1,s_1)$ and $(t_2,s_2)$ so the corresponding persistence intensity functions form an equicontinuous family.

Generalized landscape Recall that generalized landscapes use the kth maximum value of $\{{\widetilde{\varLambda }}_j(t; h):j =1,\ldots , |{\textsf {D}}|\}$, where

$$\begin{aligned} {\widetilde{\varLambda }}_j(t; h) = {\left\{ \begin{array}{ll} \frac{y_j}{K_h(0)} K_h\left( {t - x_j}\right) , &{}\quad \text {for } | \frac{t - x_j}{h} | \le 1 \\ 0, &{}\quad \text {otherwise, } \end{array}\right. } \end{aligned}$$

and $y_j =|b_j-d_j|$ is the persistence, $x_j = \frac{b_j+d_j}{2}$, and $K_h(\cdot /h) = \frac{1}{h}K(\cdot /h)$. Therefore, for any $t_1,t_2$,

$$\begin{aligned} |{\widetilde{\varLambda }}_j(t_1; h)-{\widetilde{\varLambda }}_j(t_2; h)|&\le \frac{y_j}{K_h(0)} |K_h\left( {t_1 - x_j}\right) -K_h\left( {t_2 - x_j}\right) |\\&\le \frac{y_j}{K(0)/h} \frac{1}{h}\left| K\left( \frac{t_1 - x_j}{h}\right) -K\left( \frac{t_2 - x_j}{h}\right) \right| \\&\le \frac{\mathsf{diam}({\mathbb {T}})}{K(0)} L_0 \frac{|t_1-t_2|}{h}, \end{aligned}$$

where $L_0$ is the Lipschitz constant of the kernel function K and $\mathsf{diam}({\mathbb {T}})$ is the diameter of ${\mathbb {T}}$. Thus, for a given $h>0$, ${\widetilde{\varLambda }}_j(t; h)$ is Lipschitz with a constant $\frac{\mathsf{diam}({\mathbb {T}}) L_0}{K(0)h} $. Moreover, this analysis applies to every $j=1,\ldots , |{\textsf {D}}|$ so the kth maximum value of $\{{\widetilde{\varLambda }}_j(t; h):j =1,\ldots , |{\textsf {D}}|\}$ is still Lipschitz with the same constant. This implies that the generalized landscape forms an equicontinuous family. $\square $

Proof of Proposition 2

Note that by Eq. (5), the functional summary satisfies

$$\begin{aligned} \sup _{t\in {\mathbb {T}}}\mathsf{Var}(F_i(t)) \le \sup _{t\in {\mathbb {T}}}{{\mathbb {E}}}(F^2_i(t))\le {\bar{U}}^2<\infty . \end{aligned}$$

Pointwise normality The first assertion $\sqrt{n}({\widehat{F}}(t)-{\bar{F}}(t))\rightarrow N(0,\sigma ^2(t))$ follows from the usual central limit theorem because the variance is finite.

Normality of integrated difference To show the normality for the integrated difference, note that

$$\begin{aligned} \sqrt{n}\int ({\widehat{F}}(t)-{\bar{F}}(t))dt&= \sqrt{n}\int \left( \frac{1}{n}\sum _{i=1}^n F_i(t) - {\bar{F}}(t)\right) dt\\&= \sqrt{n} \frac{1}{n}\sum _{i=1}^n \underbrace{\int \left( F_i(t)-{\bar{F}}(t)\right) dt}_{=Y_i}\\&= \sqrt{n} {\bar{Y}}_n, \end{aligned}$$

where ${\bar{Y}}_n = \frac{1}{n}\sum _{i=1}^n Y_i$ and $Y_1,\ldots ,Y_n$ are i.i.d. with mean ${{\mathbb {E}}}(Y_i) = 0$ and variance

$$\begin{aligned} \mathsf{Var}(Y_i )= \int \mathsf{Var}(F_i(t))dt = \int \sigma ^2(t)dt<\infty . \end{aligned}$$

Thus, by the usual central limit theorem again, we obtain the normality.

Convergence to a Gaussian process The proof of convergence toward a Gaussian process involves the concept of a Donsker class; see Section 2.2 of Kosorok (2007) for an introduction. Simply put, given i.i.d. random elements $S_1,\ldots , S_n\in {\mathcal {S}}$ and a collection of mappings $f_t: {\mathcal {S}}\mapsto {\mathbb {R}}$ with an index set $t\in {\mathbb {T}}$, the collection $\{f_t: t\in {\mathbb {T}}\}$ is called Donsker if the average function ${\bar{f}}(t) = \frac{1}{n}\sum _{i=1}^nf_t(S_i)$ converges to a tight Gaussian process in ${\mathcal {L}}_\infty $ metric. The assumption in Eq. (6) along with Theorem 2.5 in Kosorok (2007) implies that the class ${\mathcal {W}}$ is Donsker, which implies that the sample mean functional ${\widehat{F}}(t)$ converges to a Gaussian process. $\square $

Proof of of Proposition 3

Because Eq. (6) implies that the class ${\mathcal {W}}$ is Donsker, this proposition is a well-known result in the empirical process theory. For instance, see the discussion on page 21 of Kosorok (2007). Here we briefly highlight the basic idea.

By Proposition 3 and the continuous mapping theorem [see, e.g., page 16 of Kosorok (2007)],

$$\begin{aligned} \sqrt{n}\sup _{t\in {\mathbb {T}}}|{\widehat{F}}(t)-{\bar{F}}(t)| \overset{D}{\rightarrow } \sup _{t\in {\mathbb {T}}}|{\mathbb {B}}(t)|. \end{aligned}$$

In the case of the bootstrap, using the same proposition but now apply it to the bootstrap version, we obtain

$$\begin{aligned} \sqrt{n}\sup _{t\in {\mathbb {T}}}|{\widehat{F}}^*(t)-{\widehat{F}}(t)| \overset{D|F^{\otimes n}}{\rightarrow } \sup _{t\in {\mathbb {T}}}|{\mathbb {B}}(t)|, \end{aligned}$$

where $\overset{D|F^{\otimes n}}{\rightarrow }$ denotes convergences in distribution given $F_1,\ldots ,F_n$.

Therefore, the bootstrap quantile converges to the corresponding quantile of the original difference. $\square $

Proof of Proposition 4

This proof consists of two parts. In the first part, we prove that ${\widehat{q}}_\gamma \overset{P}{\rightarrow } q_\gamma $. In the second part, we prove the desired result.

Part 1 Given $F_1,\ldots ,F_n$, we define

$$\begin{aligned} {\widehat{Q}}(t) = \frac{1}{n}\sum _{i=1}^n I(d(F_i,{\bar{F}})\le t) \end{aligned}$$

to be the empirical version of Q(t) and define

$$\begin{aligned} {\widetilde{q}}_\gamma : {\widehat{Q}}({\widetilde{q}}_\gamma ) = \gamma , \end{aligned}$$

to be the corresponding $\gamma $ quantile.

Because ${\widehat{Q}}$ is just the empirical distribution function (EDF) of $Y_1,\ldots ,Y_n$ where $Y_i = d(F_i,{\bar{F}})$ and Q(t) is the cumulative distribution function (CDF) of $Y_i$,

$$\begin{aligned} \sup _{t} |{\widehat{Q}}(t)-Q(t)| = O_P\left( \frac{1}{\sqrt{n}}\right) . \end{aligned}$$

By the mean value theorem and the fact that ${\widehat{Q}}({\widetilde{q}}_\gamma ) -Q({\widetilde{q}}_\gamma ) = o_P(1)$,

$$\begin{aligned} {\widehat{Q}}({\widetilde{q}}_\gamma ) -Q({\widetilde{q}}_\gamma )&=Q(q_\gamma ) -Q({\widetilde{q}}_\gamma ) \\&= Q'(q_\gamma ^*) (q_\gamma -{\widetilde{q}}_\gamma ) \end{aligned}$$

for some $q_\gamma ^* \in [q_\gamma , {\widetilde{q}}_\gamma ]$. By assumption, $Q'(q_\gamma ^*)\ge q_0>0$ so

$$\begin{aligned} |q_\gamma -{\widetilde{q}}_\gamma | \le \frac{1}{q_0} \left| {\widehat{Q}}({\widetilde{q}}_\gamma ) -Q({\widetilde{q}}_\gamma )\right| =O_P\left( \frac{1}{\sqrt{n}}\right) . \end{aligned}$$

Now, because $\max {i=1,\ldots ,n}|d(F_i,{\bar{F}})-d(F_i,{\widehat{F}})| \le d({\bar{F}},{\widehat{F}})$, ${\widetilde{q}}_\gamma $, the quantile of $d(F_1,{\bar{F}}),\ldots , d(F_n,{\bar{F}})$, and ${\widehat{q}}_\gamma $, the quantile of $d(F_1,{\widehat{F}}),\ldots , d(F_n,{\widehat{F}})$, are bounded by

$$\begin{aligned} |{\widetilde{q}}_\gamma -{\widehat{q}}_\gamma | \le d({\bar{F}},{\widehat{F}}). \end{aligned}$$

which implies

$$\begin{aligned} |{\widehat{q}}_\gamma - q_\gamma |=O_P\left( \frac{1}{\sqrt{n}}\right) +d({\bar{F}},{\widehat{F}}). \end{aligned}$$

Part 2 Let

$$\begin{aligned} {\widehat{A}}_\gamma = P(F_\mathsf{new}\subset \widehat{{\mathcal {F}}}_\gamma |F_1,\ldots ,F_n) =P(d(F_\mathsf{new}, {\widehat{F}})\le {\widehat{q}}_\gamma |F_1,\ldots ,F_n). \end{aligned}$$

be the probability of interest. By the triangular inequality,

$$\begin{aligned} |d(F_\mathsf{new}, {\bar{F}})-d(F_\mathsf{new}, {\widehat{F}}) | \le d({\bar{F}}, {\widehat{F}}). \end{aligned}$$

Thus,

$$\begin{aligned} A_\gamma&= P(d(F_\mathsf{new}, {\widehat{F}})\le {\widehat{q}}_\gamma |F_1,\ldots ,F_n) \\&\ge P(d(F_\mathsf{new}, {\bar{F}})\le {\widehat{q}}_\gamma -d({\widehat{F}},{\bar{F}})|F_1,\ldots ,F_n)\\&= Q({\widehat{q}}_\gamma -d({\widehat{F}},{\bar{F}}))\\&\ge Q\left( q_\gamma -2d({\widehat{F}},{\bar{F}})+O_P\left( \frac{1}{\sqrt{n}}\right) \right) \\&= Q(q_\gamma ) + O_P\left( \left( \frac{1}{\sqrt{n}}\right) +d({\widehat{F}},{\bar{F}})\right) .\\ A_\gamma&= P(d(F_\mathsf{new}, {\widehat{F}})\le {\widehat{q}}_\gamma |F_1,\ldots ,F_n) \\&\le P(d(F_\mathsf{new}, {\bar{F}})\le {\widehat{q}}_\gamma +d({\widehat{F}},{\bar{F}})|F_1,\ldots ,F_n)\\&=Q({\widehat{q}}_\gamma +d({\widehat{F}},{\bar{F}}))\\&\le Q\left( q_\gamma +2d({\widehat{F}},{\bar{F}})+O_P\left( \frac{1}{\sqrt{n}}\right) \right) \\&= Q(q_\gamma ) + O_P\left( \left( \frac{1}{\sqrt{n}}\right) +d({\widehat{F}},{\bar{F}})\right) .\\ \end{aligned}$$

Thus,

$$\begin{aligned} A_\gamma = P(F_\mathsf{new}\subset \widehat{{\mathcal {F}}}_\gamma |F_1,\ldots ,F_n)= \underbrace{Q(q_\gamma )}_{=\gamma }+ O_P\left( \left( \frac{1}{\sqrt{n}}\right) +d({\widehat{F}},{\bar{F}})\right) , \end{aligned}$$

which completes the proof. $\square $

B STIX simulation results

For completeness, results for the STIX simulation study from Sect. 4.2.1 are displayed for $t_2 = 5, 5.25, 5.5, 5.75, 6, 7$. Figures 16 and 17 include the 0th and 1st homology dimensions, respectively, for the landscape and generalized landscape functions for their first 10 function orders; a subset of the results for the 1st homology dimension are displayed in Fig. 12 and discussed in the main text. Note that the results for the setting with $t_2 = 8$ are not displayed because the permutation p values for all function orders for the 1st homology dimension achieve the minimum possible p value, and most of the function orders for the 0th homology dimension achieve this minimum as well.

Similarly, the results for the other functional summaries and the concatenated orders of the landscape and generalized landscape functions for $t_2 = 5, 5.25, 5.5, 5.75, 6, 7$ are displayed in Figs. 18 and 19 for the 0th and 1st homology dimensions, respectively. A subset of the results for the 1st homology dimension are displayed in Fig. 13 and discussed in the main text. Note that the results for the setting with $t_2 = 8$ are not displayed because the permutation p values for all functional summaries for the 1st homology dimension achieve the minimum possible p value, and all functional summaries (except the PDT for the 0th homology dimension) achieve this minimum as well.

C Alternative generalized landscape function kernel results

1.1 C.1 Simulated Gleason data

Figure 20 displays classification test error results for generalized landscape functions using an alternative kernel function (the tricubic kernel) to the one considered in Sect. 4.1 (the triangle kernel). The ratio of the classification test errors are plot with the tricubic kernel in the numerator and the corresponding triangle kernel plotted in the denominator. The ratios vary between about 0.98 and 1.02 suggesting that the kernel choice is not strongly affecting the results in this setting.

1.2 C.2 Pick-up sticks simulation data

Figure 21 displays the two-sample hypothesis test results for generalized landscape functions using an alternative kernel function (the tricubic kernel) for the setting with samples drawn from the null population, ${\mathcal {P}}_{F,1}$, with an average thickness of $t_1 = 5$ and an alternative hypothesis with an average thicknesses of $t_2 = 5.75$. Additionally, a comparison between the median p values using the tricubic kernel compared to the triangle kernal considered in Sect. 4.2.1. There appears to be more variability in these results compared to what we saw in “Appendix C.1”; however, the p values used in this setting tend to be smaller values than the classification test error values so a larger variation may imply only a small magnitude difference. For $H_0$, the largest magnitude difference in p values (across all function orders and thicknesses considered) is 0.089 (in the setting with $t_2 = 5$, function order 3, bandwidth of 0.60), with an interquartile range of 0 to 0.014. For $H_1$, the largest magnitude difference in p values (across all function orders and thicknesses considered) is 0.065 (in the setting with $t_2 = 5$, function order 2, bandwidth of 0.40), with an interquartile range of 0 to 0.006. There does not appear to be a discernible pattern in the variations between the tricubic and triangle kernel function results.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berry, E., Chen, YC., Cisewski-Kehe, J. et al. Functional summaries of persistence diagrams. J Appl. and Comput. Topology 4, 211–262 (2020). https://doi.org/10.1007/s41468-020-00048-w

Download citation

Received: 27 May 2018
Accepted: 31 January 2020
Published: 31 March 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s41468-020-00048-w

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multiscale Persistent Functions for Biomolecular Structure Characterization