article

Significant motifs in time series

Authors:

Nuno C. Castro,

Paulo J. AzevedoAuthors Info & Claims

Statistical Analysis and Data Mining, Volume 5, Issue 1

Pages 35 - 53

Published: 01 February 2012 Publication History

Abstract

Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends to be prohibitively large. Statistical significance tests are widely used in the data mining communities to evaluate extracted patterns. In this work we present an approach to calculate time series motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif's p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique-statistical tests-to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 5: 35-53, 2012

References

[1]

<label>1</label> P.Ferreira, P.Azevedo, C.Silva, and R.Brito, Mining approximate motifs in time series, in Discovery Science, Secaucus, New Jersey, Springer, 2006, pp.89-101.

Digital Library

[2]

<label>2</label> J.Lin, E.Keogh, S.Lonardi, and P.Patel, Finding motifs in time series, In Proc. of the 2nd Workshop on Temporal Data Mining, Citeseer, 2002, pp.53-68.

[3]

<label>3</label> B.Chiu, E.Keogh, and S.Lonardi, Probabilistic discovery of time series motifs, In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2003, pp.498.

Digital Library

[4]

<label>4</label> Y.Tanaka, K.Iwamoto, and K.Uehara, Discovery of time-series motif from multi-dimensional data based on mdl principle, Mach Learn Volume 58 2 2005, pp.269-300.

Digital Library

[5]

<label>5</label> T.Oates, PERUSE: an unsupervised algorithm for finding recurring patterns in time series, IEEE ICDM Volume 2 2002, pp.5.

[6]

<label>6</label> D.Yankov, E.Keogh, J.Medina, B.Chiu, and V.Zordan, Detecting time series motifs under uniform scaling, In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp.844-853.

Digital Library

[7]

<label>7</label> D.Minnen, T.Starner, I.Essa, and C.Isbell, Improving activity discovery with automatic neighborhood estimation, Proceedings of the 20th international joint conference on Artifical intelligence, San Francisco, California, Morgan Kaufmann Publishers Inc., 2007, pp.2814-2819.

Digital Library

[8]

<label>8</label> F.Mörchen and A.Ultsch, Efficient mining of understandable patterns from multivariate interval time series, Data Min Knowl Discov Volume 15 2 2007, pp.181-215.

Digital Library

[9]

<label>9</label> A.Mueen, E.Keogh, Q.Zhu, S.Cash, and B.Westover, Exact discovery of time series motifs, In Proceedings of the Ninth SIAM International Conference on Data Mining SDM, 2009, pp.473-484.

[10]

<label>10</label> A.Mueen and E.Keogh, Online discovery and maintenance of time series motifs, In Proceedings of the sixteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2010, pp.1089-1098.

[11]

<label>11</label> A.Mueen, E.Keogh, and N.Bigdely-Shamlo, Finding time series motifs in disk-resident data, In 2009 Ninth IEEE International Conference on Data Mining, 2009, pp.367-376.

Digital Library

[12]

<label>12</label> N.Castro and P.Azevedo, Multiresolution motif discovery in time series, In Proceedings of the Tenth SIAM International Conference on Data Mining, 2010, pp.665-676.

[13]

<label>13</label> E.Keogh and T.Folias, The UCR Time Series Data Mining Archive, Riverside CA, University of California-Computer Science & Engineering Department, 2002.

[14]

<label>14</label> S.Robin, S.Schbath, and V.Vandewalle, Statistical tests to compare motif count exceptionalities, BMC Bioinformatics Volume 8 1 2007, pp.84.

[15]

<label>15</label> R.Milo, S.Shen-Orr, S.Itzkovitz, N.Kashtan, D.Chklovskii, and U.Alon, Network motifs: simple building blocks of complex networks, Science Volume 298 5594 2002, pp.824.

[16]

<label>16</label> G.Webb, Discovering significant patterns, Mach Learn Volume 68 1 2007, pp.1-33.

Digital Library

[17]

<label>17</label> P. G.Ferreira and P. J.Azevedo, Evaluating protein motif significance measures: a case study on prosite patterns, In IEEE Symposium on Computational Intelligence and Data Mining CIDM 2007, 2007, pp.171-178.

[18]

<label>18</label> J.Zhang, B.Jiang, M.Li, J.Tromp, X.Zhang, and M.Zhang, Computing exact P-values for DNA motifs, Bioinformatics Volume 23 5 2007, pp.531.

Digital Library

[19]

<label>19</label> T.Marschall and S.Rahmann, Efficient exact motif discovery, Bioinformatics Volume 25 12 2009, pp.i356.

Digital Library

[20]

<label>20</label> G.Nuel, Effective p-value computations using Finite Markov Chain Imbedding FMCI: application to local score and to pattern statistics, Algorithms Mol Biol Volume 1 1 2006, pp.5.

[21]

<label>21</label> V.Boeva, J.Clément, M.Régnier, M.Roytberg, and V.Makeev, Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules, Algorithms Mol Biol Volume 2 1 2007, pp.13.

[22]

<label>22</label> C.Low Kam, A.Mas, and M.Teisseire, Mining for unexpected sequential patterns given a Markov model, 2008. http://www.math.univ-montp2.fr/~mas/lmt_siam09.pdf.

[23]

<label>23</label> J.Hollunder, M.Friedel, A.Beyer, C.Workman, and T.Wilhelm, DASS: efficient discovery and p-value calculation of substructures in unordered data, Bioinformatics Volume 23 1 2007, pp.77.

Digital Library

[24]

<label>24</label> S.Robin and S.Schbath, Numerical comparison of several approximations of the word count distribution in random sequences, J Comput Biol Volume 8 4 2001, pp.349-359.

[25]

<label>25</label> M.Régnier and M.Vandenbogaert, Comparison of statistical significance criteria, J Bioinformatics Comput Biol Volume 4 2 2006, pp.537-552.

[26]

<label>26</label> S.Schbath, An overview on the distribution of word counts in Markov chains, J Comput Biol Volume 7 1-2 2000, pp.193-201.

[27]

<label>27</label> H.He and A.Singh, Graphrank: statistical modeling and mining of significant subgraphs in the feature space, In Sixth International Conference on Data Mining ICDM'06, 2006, pp.885-890.

Digital Library

[28]

<label>28</label> S.Jacquemont, F.Jacquenet, and M.Sebban, Mining probabilistic automata: a statistical view of sequential pattern mining, Mach Learn Volume 75 1 2009, pp.91-127.

Digital Library

[29]

<label>29</label> P.Ribeca and E.Raineri, Faster exact Markovian probability functions for motif occurrences: a DFA-only approach, Bioinformatics Volume 24 24 2008, pp.2839.

Digital Library

[30]

<label>30</label> C.Matias, S.Schbath, E.Birmelé, J.Daudin, and S.Robin, Network motifs: mean and variance for the count, REVSTAT Stat J Volume 4 1 2006, pp.31-51.

[31]

<label>31</label> F.Picard, J.Daudin, M.Koskas, S.Schbath, and S.Robin, Assessing the exceptionality of network motifs, J Comput Biol Volume 15 1 2008, pp.1-20.

[32]

<label>32</label> E.Keogh, S.Lonardi, and B.Chiu, Finding surprising patterns in a time series database in linear time and space, In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp.550-556.

Digital Library

[33]

<label>33</label> E.Keogh and S.Kasetty, On the need for time series data mining benchmarks: a survey and empirical demonstration, Data Min Knowl Discov Volume 7 4 2003, pp.349-371.

[34]

<label>34</label> H.Ding, G.Trajcevski, P.Scheuermann, X.Wang, and E.Keogh, Querying and mining of time series data: experimental comparison of representations and distance measures, Proc VLDB Endowment Volume 1 2 2008, pp.1542-1552.

Digital Library

[35]

<label>35</label> J.Shieh and E.Keogh, iSAX: indexing and mining terabyte sized time series, In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp.623-631.

[36]

<label>36</label> S.Schbath, Statistics of motifs, Atelier de Form Volume 1502 2006.

[37]

<label>37</label> S.Robin, F.Rodolphe, and S.Schbath, DNA, Words and Models, New York, Cambridge Univ. Press, 2005.

[38]

<label>38</label> S.Holm, A simple sequentially rejective multiple test procedure, Scand J Stat Volume 6 2 1979, pp.65-70.

[39]

<label>39</label> S.Hanhijärvi, K.Puolamäki, and G.Garriga, Multiple hypothesis testing in pattern discovery, STAT Volume 1050 2009, pp.29.

[40]

<label>40</label> J.Storey and R.Tibshirani, Statistical significance for genomewide studies, Proc Natl Acad Sci USA Volume 100 16 2003, pp.9440.

[41]

<label>41</label> S.Hanhijärvi, K.Puolamäki, and G.Garriga, Multiple hypothesis testing in pattern discovery, Arxiv preprint arXiv:0906.5263, 2009.

[42]

<label>42</label> Y.Benjamini and Y.Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J R Stat Soc B Volume 57 1 1995, pp.289-300.

[43]

<label>43</label> Y.Benjamini and M.Leshno, Statistical methods for data mining, Data Mining and Knowledge Discovery Handbook, Tel-Aviv, Israel, Springer, 2005, pp.565-587.

[44]

<label>44</label> H.Zhang, B.Padmanabhan, and A.Tuzhilin, On the discovery of significant statistical quantitative rules, In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2004, pp.374-383.

Digital Library

[45]

<label>45</label> S.Santosh Bangalore, J.Wang, and D.Allison, How accurate are the extremely small p-values used in genomic research: an evaluation of numerical libraries, Comput Stat Data Anal Volume 53 7 2009, pp.2446-2452.

Digital Library

[46]

<label>46</label> E.Keogh, J.Lin, and A.Fu, HOT SAX: efficiently finding the most unusual time series subsequence, In Proceedings of the Fifth IEEE International Conference on Data Mining, IEEE Computer Society, 2005, pp.233.

Digital Library

[47]

<label>47</label> N.Castro, Time series motifs statistical significance website. "http://www.di.uminho.pt/ castro/stat".

[48]

<label>48</label> Y.Gao, J.He, J.Zou, R.Zeng, and X.Liang, Fractal simulation of soil breakdown under lightning current, J Electrostat Volume 61 3-4 2004, pp.197-207.

Cited By

Borges HDutra MBazaz ACoutinho RPerosi FPorto FMasseglia FPacitti EOgasawara E(2020)Spatial-time motifs discoveryIntelligent Data Analysis10.3233/IDA-19475924:5(1121-1140)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/IDA-194759
Petitjean FLi TTatti NWebb G(2016)SkopusData Mining and Knowledge Discovery10.1007/s10618-016-0467-930:5(1086-1111)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10618-016-0467-9
Jiao ZZou LCao YQian NMa Z(2014)Effective connectivity analysis of fMRI data based on network motifsThe Journal of Supercomputing10.1007/s11227-013-1010-z67:3(806-819)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1007/s11227-013-1010-z

Significant motifs in time series
1. Information systems
  1. Information systems applications
2. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms

Recommendations

Detection of over-represented motifs corresponding to known TFBSs via motif clustering and matching

Detection of over-represented motifs corresponding to known TFBSs (Transcription Factor Binding Sites) is an important problem in biological sequences analysis. In this paper, a novel motif discovery method based on motif clustering and matching is ...
Analysis of Cis-Regulatory Motifs in Cassette Exons by Incorporating Exon Skipping Rates
ISBRA '09: Proceedings of the 5th International Symposium on Bioinformatics Research and Applications

Identification of cis-regulatory motifs has long been a hotspot in the study of alternative splicing. We propose a two-step approach: we first identify k-mer seed motifs by testing for enrichment and significant differences in exon skipping rate, then a ...
A survey on time series motif discovery

Time series motifs are repeated subsequences in a long time series. Discovering time series motifs is an important task in time series data mining and this problem has received significant attention from researchers in data mining communities. In this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Statistical Analysis and Data Mining

Statistical Analysis and Data Mining Volume 5, Issue 1

February 2012

91 pages

ISSN:1932-1864

EISSN:1932-1872

Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 February 2012

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Borges HDutra MBazaz ACoutinho RPerosi FPorto FMasseglia FPacitti EOgasawara E(2020)Spatial-time motifs discoveryIntelligent Data Analysis10.3233/IDA-19475924:5(1121-1140)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/IDA-194759
Petitjean FLi TTatti NWebb G(2016)SkopusData Mining and Knowledge Discovery10.1007/s10618-016-0467-930:5(1086-1111)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10618-016-0467-9
Jiao ZZou LCao YQian NMa Z(2014)Effective connectivity analysis of fMRI data based on network motifsThe Journal of Supercomputing10.1007/s11227-013-1010-z67:3(806-819)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1007/s11227-013-1010-z

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents