Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/882082.882086acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

A symbolic representation of time series, with implications for streaming algorithms

Published: 13 June 2003 Publication History

Abstract

The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued.Many researchers have also considered transforming real valued time series into symbolic representations, nothing that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly "batch-only" problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms.In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.

References

[1]
Agrawal, R., Psaila, G., Wimmers, E. L. amp; Zait, M. (1995). Querying Shapes of Histories. In proceedings of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept 11--15. pp 502--514.]]
[2]
André-Jönsson, H. amp; Badal. D. (1997). Using Signature Files for Querying Time-Series Data. In proceedings of Principles of Data Mining and Knowledge Discovery. 1st European Symposium. Trondheim, Norway, Jun 24--27. pp 211--220.]]
[3]
Apostolico, A., Block, M. E. amp; Lonardi, S. (2002). Monotony of Surprise and Large-Scale Quest for Unusual Words. In proceedings of the 6th Int'l Conference on Research in Computational Molecular Biology. Washington, DC, April 18--21. pp 22--31.]]
[4]
Babcock, B, Babu, S., Datar, M., Motwani, R. amp; Widom, J. (2002). Models and Issues in Data Stream Systems. Invited Paper in proceedings of the 2002 ACM Symp. On Principles of Database Systems. June 3--5, Madison, WI.]]
[5]
Bastogne, T., Noura, H., Richard A. amp; Hittinger, J. M. (2002). Application of Subspace Methods to the Identification of a Winding Process. In proceedings of the 4th European Control Conference, Vol. 5, Brussels.]]
[6]
Berndt, D. amp; Clifford, J. (1994) Using Dynamic Time Warping to Find Patterns in Time Series. In proceedings of the Workshop on Knowledge Discovery in Databases, at the 12th Int'l Conference on Artificial Intelligence. July 31-Aug 4, Seattle, WA. pp 229--248.]]
[7]
Chan, K. amp; Fu, A. W. (1999). Efficient Time Series Matching by Wavelets. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23--26. pp 126--133.]]
[8]
Cortes, C., Fisher, K., Pregibon, D., Rogers, A. amp; Smith, F. (2000). Hancock: a Language for Extracting Signatures from Data Streams. In proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining. Aug 20--23, Boston, MA. pp 9--17.]]
[9]
Dasgupta, D. amp; Forrest, S. (1996) Novelty Detection in Time Series Data using Ideas from Immunology. In proceedings of The International Conference on Intelligent Systems. June 19--21.]]
[10]
Datar, M. amp; Muthukrishnan, S. (2002). Estimating Rarity and Similarity over Data Stream Windows. In proceedings of the 10th European Symposium on Algorithms. Sep 17--21, Rome, Italy.]]
[11]
Daw, C. S., Finney, C. E. A. amp; Tracy, E. R. (2001). Symbolic Analysis of Experimental Data. Review of Scientific Instruments. (2002-07-22).]]
[12]
Ding, C., He, X., Zha, amp; Simon., H. (2002). Adaptive Dimension Reduction for Clustering High Dimensional Data. In proceedings of the 2nd IEEE International Conference on Data Mining. Dec 9--12. Maebashi, Japan. pp 147--154.]]
[13]
Durbin, R., Eddy, S., Krogh, A. amp; Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.]]
[14]
Faloutsos, C., Ranganathan, M., amp; Manolopoulos, Y. (1994). Fast Subsequence Matching in Time-Series Databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. May 24--27, Minneapolis, MN. pp 419--429.]]
[15]
Fayyad, U., Reina, C. &. Bradley, P. (1998). Initialization of Iterative Refinement Clustering Algorithms. In proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 194--198.]]
[16]
Geurts, P. (2001). Pattern Extraction for Time Series Classification. In proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery. Sep 3--7, Freiburg, Germany. pp. 115--127.]]
[17]
Gionis, A. amp; Mannila, H. (2003). Finding Recurrent Sources in Sequences. In proceedings of the 7th International Conference on Research in Computational Molecular Biology. Apr 10--13, Berlin, Germany. To Appear.]]
[18]
Guha, S., Mishra, N., Motwani, R. amp; O'Callaghan, L. (2000). Clustering Data Streams. In proceedings of the 41st Symposium on Foundations of Computer Science. Nov 12--14, Redondo Beach, CA. pp 359--366.]]
[19]
Hellerstein, J. M., Papadimitriou, C. H. amp; Koutsoupias, E. (1997). Towards an Analysis of Indexing Schemes. In proceedings of the 16th ACM Symposium on Principles of Database Systems. May 12--14, Tueson, AZ. pp 249--256.]]
[20]
Huang, Y. amp; Yu, P. S. (1999). Adaptive Query Processing for Time-Series Data. In proceedings of the 5th Int'l Conference on Knowledge Discovery and Data Mining. San Diego, CA, Aug 15--18. pp 282--286.]]
[21]
Kalpakis, K., Gada, D. amp; Puttagunta, V. (2001). Distance Measures for Effective Clustering of ARIMA Time-Series. In proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, Nov 29-Dec 2. pp 273--280.]]
[22]
Keogh, E., Chakrabarti, K., Pazzani, M. amp; Mehrotra, S. (2001). Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. In proceedings of ACM SIGMOD Conference on Management of Data. Santa Barbara, CA, May 21--24. pp 151--162.]]
[23]
Keogh, E. amp; Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 -- 26, 2002. Edmonton, Alberta, Canada. pp 102--111.]]
[24]
Keogh, E., Lonardi, S. amp; Chiu. W. (2002). Finding Surprising Patterns in a Time Series Database in Linear Time and Space. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 -- 26, 2002. Edmonton, Alberta, Canada. pp 550--556.]]
[25]
Keogh, E. amp; Pazzani, M. (1998). An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 239--241.]]
[26]
Lin, J., Keogh, E., Lonardi, S. amp; Patel, P. (2002). Finding Motifs in Time Series. In proceedings of the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 23--26. pp. 53--68.]]
[27]
Larsen, R. J. amp; Marx, M. L. (1986). An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, Englewood, Cliffs, N.J. 2nd Edition.]]
[28]
Lonardi, S. (2001). Global Detectors of Unusual Words: Design, Implementation, and Application to Pattern Discovery in Biosequences. PhD thesis, Department of Computer Sciences, Purdue University, August, 2001.]]
[29]
Reinert, G., Schbath, S. amp; Waterman, M. S. (2000). Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational. Biology. Vol. 7, pp 1--46.]]
[30]
Roddick, J. F., Hornsby, K. amp; Spiliopoulou, M. (2001). An Updated Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. In Post-Workshop Proceedings of the International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining. Berlin, Springer. Lecture Notes in Artificial Intelligence. Roddick, J. F. and Hornsby, K., Eds. 147--163.]]
[31]
Shahabi, C., Tian, X. amp; Zhao, W. (2000). TSA-tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries. In proceedings of the 12th Int'l Conference on Scientific and Statistical Database Management. pp 55--68.]]
[32]
Staden, R. (1989). Methods for Discovering Novel Motifs in Nucleic Acid Sequences. Computer Applications in Biosciences. Vol. 5(5). pp 293--298.]]
[33]
Tompa, M. amp; Buhler, J. (2001). Finding Motifs Using Random Projections. In proceedings of the 5th Int'l Conference on Computational Molecular Biology. Montreal, Canada, Apr 22--25. pp 67--74.]]
[34]
Vlachos, M., Kollios, G. amp; Gunopulos, G. (2002). Discovering Similar Multidimensional Trajectories. In proceedings of the 18th International Conference on Data Engineering. Feb 26-Mar 1, San Jose, CA.]]
[35]
Yi, B, K., amp; Faloutsos, C. (2000). Fast Time Sequence Indexing for Arbitrary Lp Norms. In proceedings of the 26st Int'l Conference on Very Large Databases. Sep 10--14, Cairo, Egypt. pp 385--394.]]

Cited By

View all
  • (2025)Baseline-free damage detection and localization on complex composite structures using unsupervised shapelets and shift-invariant dictionary learningMechanical Systems and Signal Processing10.1016/j.ymssp.2024.112035224(112035)Online publication date: Jan-2025
  • (2024)A Semantic Hybrid Temporal Approach for Detecting Driver Mental FatigueSafety10.3390/safety1001000910:1(9)Online publication date: 9-Jan-2024
  • (2024)Unfixed Seasonal Partition Based on Symbolic Aggregate Approximation for Forecasting Solar Power Generation Using Deep LearningElectronics10.3390/electronics1319387113:19(3871)Online publication date: 30-Sep-2024
  • Show More Cited By
  1. A symbolic representation of time series, with implications for streaming algorithms

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DMKD '03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
      June 2003
      103 pages
      ISBN:9781450374224
      DOI:10.1145/882082
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2003

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data mining
      2. data streams
      3. discretize
      4. symbolic
      5. time series

      Qualifiers

      • Article

      Conference

      DMKD03
      Sponsor:

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)332
      • Downloads (Last 6 weeks)51
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Baseline-free damage detection and localization on complex composite structures using unsupervised shapelets and shift-invariant dictionary learningMechanical Systems and Signal Processing10.1016/j.ymssp.2024.112035224(112035)Online publication date: Jan-2025
      • (2024)A Semantic Hybrid Temporal Approach for Detecting Driver Mental FatigueSafety10.3390/safety1001000910:1(9)Online publication date: 9-Jan-2024
      • (2024)Unfixed Seasonal Partition Based on Symbolic Aggregate Approximation for Forecasting Solar Power Generation Using Deep LearningElectronics10.3390/electronics1319387113:19(3871)Online publication date: 30-Sep-2024
      • (2024)Real-Time Salt Contamination Monitoring System and Method for Transmission Line Insulator Based on Artificial IntelligenceApplied Sciences10.3390/app1404150614:4(1506)Online publication date: 13-Feb-2024
      • (2024)Driving behavior characterization and traffic emission analysis considering the vehicle trajectoryFrontiers in Psychology10.3389/fpsyg.2023.134161114Online publication date: 29-Jan-2024
      • (2024)fABBA: A Python library for the fast symbolic approximation of time seriesJournal of Open Source Software10.21105/joss.062949:95(6294)Online publication date: Mar-2024
      • (2024)Composer Classification Using Maximum Probability Partitioning Based on Compression Principles圧縮原理に基づく最大確率分割情報量を用いた作曲者分類Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_F-NA139:2(F-NA1_1-10)Online publication date: 1-Mar-2024
      • (2024)Trajectory Similarity Measurement: An Efficiency PerspectiveProceedings of the VLDB Endowment10.14778/3665844.366585817:9(2293-2306)Online publication date: 1-May-2024
      • (2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 1-May-2024
      • (2024)CIVET: Exploring Compact Index for Variable-Length Subsequence Matching on Time SeriesProceedings of the VLDB Endowment10.14778/3665844.366584517:9(2123-2135)Online publication date: 1-May-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media