Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2020408.2020607acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Scalable kNN search on vertically stored time series

Published: 21 August 2011 Publication History

Abstract

Nearest-neighbor search over time series has received vast research attention as a basic data mining task. Still, none of the hitherto proposed methods scales well with increasing time-series length. This is due to the fact that all methods provide an one-off pruning capacity only. In particular, traditional methods utilize an index to search in a reduced-dimensionality feature space; however, for high time-series length, search with such an index yields many false hits that need to be eliminated by accessing the full records. An attempt to reduce false hits by indexing more features exacerbates the curse of dimensionality, and vice versa. A recently proposed alternative, iSAX, uses symbolic approximate representations accessed by a simple file-system directory as an index. Still, iSAX also encounters false hits, which are again eliminated by accessing records in full: once a false hit is generated by the index, there is no second chance to prune it; thus, the pruning capacity iSAX provides is also one-off. This paper proposes an alternative approach to time series kNN search, following a nontraditional pruning style. Instead of navigating through candidate records via an index, we access their features, obtained by a multi-resolution transform, in a stepwise sequential-scan manner, one level of resolution at a time, over a vertical representation. Most candidates are progressively eliminated after a few of their terms are accessed, using pre-computed information and an unprecedentedly tight double-bounding scheme, involving not only lower, but also upper distance bounds. Our experimental study with large, high-length time-series data confirms the advantage of our approach over both the current state-of-the-art method, iSAX, and classical index-based methods.

References

[1]
iSAX page. http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html.
[2]
R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In FODO, 1993.
[3]
S. Arya, T. Malamatos, and D. M. Mount. Space-time tradeoffs for approximate nearest neighbor searching. J. ACM, 57(1), 2009.
[4]
I. Assent, R. Krieger, F. Afschari, and T. Seidl. The TStree: efficient time series search and retrieval. In EDBT, 2008.
[5]
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The RStartree: an efficient and robust access method for points and rectangles. In SIGMOD, 1990.
[6]
S. Berchtold, C. Böhm, H. V. Jagadish, H.-P. Kriegel, and J. Sander. Independent quantization: An index compression technique for high-dimensional data spaces. In ICDE, 2000.
[7]
S. Berchtold, C. Böhm, and H.-P. Kriegal. The pyramid-technique: towards breaking the curse of dimensionality. In SIGMOD, 1998.
[8]
S. Berchtold, D. A. Keim, and H.-P. Kriegel. The Xtree: An index structure for high-dimensional data. In VLDB, 1996.
[9]
S. Blott and R. Weber. What's wrong with high-dimensional similarity search? PVLDB, 1(1):3--3, 2008.
[10]
C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322--373, 2001.
[11]
Y. Cai and R. Ng. Indexing spatio-temporal trajectories with Chebyshev polynomials. In SIGMOD, 2004.
[12]
K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM TODS, 27(2):188--228, 2002.
[13]
K. Chakrabarti and S. Mehrotra. The hybrid tree: An index structure for high dimensional feature spaces. In ICDE, 1999.
[14]
K.-P. Chan and A. W.-C. Fu. Efficient time series matching by wavelets. In ICDE, 1999.
[15]
L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD, 2005.
[16]
Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. Indexable PLA for efficient similarity search. In VLDB, 2007.
[17]
P. Ciaccia, M. Patella, and P. Zezula. \mtree: An efficient access method for similarity search in metric spaces. In VLDB, 1997.
[18]
B. Cui, B. C. Ooi, J. Su, and K.-L. Tan. Indexing high-dimensional data for efficient in-memory similarity search. IEEE TKDE, 17(3):339--353, 2005.
[19]
A. P. de~Vries, N. Mamoulis, N. Nes, and M. Kersten. Efficient kNN search on vertically decomposed data. In SIGMOD, 2002.
[20]
H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB, 1(2), 2008.
[21]
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, 1994.
[22]
P. Geurts. Pattern extraction for time series classification. In PKDD, 2001.
[23]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.
[24]
A. Guttman. Rtrees: a dynamic index structure for spatial searching. In SIGMOD, 1984.
[25]
J. M. Hellerstein, E. Koutsoupias, and C. H. Papadimitriou. On the analysis of indexing schemes. In PODS, 1997.
[26]
G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM TODS, 24(2):265--318, 1999.
[27]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.
[28]
H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+tree-based indexing method for nearest neighbor search. ACM TODS, 30(2):364--397, 2005.
[29]
K. V. R. Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In SIGMOD, 1998.
[30]
P. Karras and N. Mamoulis. One-pass wavelet synopses for maximum-error metrics. In VLDB, 2005.
[31]
P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. In ICDE, 2007.
[32]
N. Katayama and S. Satoh. The SRtree: an index structure for high-dimensional nearest neighbor queries. In SIGMOD, 1997.
[33]
E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst., 3(3):263--286, 2001.
[34]
F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In SIGMOD, 1997.
[35]
F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. Fast nearest neighbor search in medical image databases. In VLDB, 1996.
[36]
A. Koski, M. Juhola, and M. Meriste. Syntactic recognition of ECG signals by attributed finite automata. Pattern Recognition, 28(12):1927--1940, 1995.
[37]
J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. In SIGMOD, 1999.
[38]
X. Lian, L. Chen, J. X. Yu, J. Han, and J. Ma. Multiscale representations for fast pattern matching in stream time series. IEEE TKDE, 21(4):568--581, 2009.
[39]
J. Lin, E. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov., 15(2):107--144, 2007.
[40]
K. I. Lin, H. V. Jagadish, and C. Faloutsos. The TVtree: an index structure for high-dimensional data. The VLDB Journal, 3(4):517--542, 1994.
[41]
I. Popivanov and R. J. Miller. Similarity search over time-series data using wavelets. In ICDE, 2002.
[42]
N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, 1995.
[43]
Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima. The Atree: An index structure for high-dimensional spaces using relative approximation. In VLDB, 2000.
[44]
T. Seidl and H.-P. Kriegel. Optimal multi-step k-nearest neighbor search. In SIGMOD, 1998.
[45]
T. K. Sellis, N. Roussopoulos, and C. Faloutsos. The R+tree: A dynamic index for multi-dimensional objects. In VLDB, 1987.
[46]
J. Shieh and E. Keogh. iSAX: indexing and mining terabyte sized time series. In KDD, 2008.
[47]
H. Sun, Ö. Öztürk, and H. Ferhatosmanouglu. CoMRI: A compressed multi-resolution index structure for sequence similarity queries. In CSB, 2003.
[48]
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, 2009.
[49]
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.
[50]
D. Wu, A. Singh, D. Agrawal, A. El~Abbadi, and T. R. Smith. Efficient retrieval for browsing large image databases. In CIKM, 1996.
[51]
H. Wu, B. Salzberg, and D. Zhang. Online event-driven subsequence matching over financial data streams. In SIGMOD, 2004.
[52]
M.-Y. Yeh, K.-L. Wu, P. S. Yu, and M.-S. Chen. LEEWAVE: Level-wise distribution of wavelet coefficients for processing kNN queries over distributed streams. In VLDB, 2008.
[53]
B.-K. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary Lp norms. In VLDB, 2000.
[54]
C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to kNN processing. In VLDB, 2001.

Cited By

View all
  • (2024)PISD: A linear complexity distance beats dynamic time warping on time series classification and clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109222138(109222)Online publication date: Dec-2024
  • (2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 22-Jun-2023
  • (2023)Marigold: Efficient k-Means Clustering in High DimensionsProceedings of the VLDB Endowment10.14778/3587136.358714716:7(1740-1748)Online publication date: 1-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. nearest neighbor
  2. similarity search
  3. time series

Qualifiers

  • Poster

Conference

KDD '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)7
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PISD: A linear complexity distance beats dynamic time warping on time series classification and clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109222138(109222)Online publication date: Dec-2024
  • (2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 22-Jun-2023
  • (2023)Marigold: Efficient k-Means Clustering in High DimensionsProceedings of the VLDB Endowment10.14778/3587136.358714716:7(1740-1748)Online publication date: 1-Mar-2023
  • (2022)Efficient Range and kNN Twin Subsequence Search in Time SeriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3167257(1-1)Online publication date: 2022
  • (2022)Fast Adaptive Similarity Search through Variance-Aware Quantization2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00268(2969-2983)Online publication date: May-2022
  • (2022)PARROT: pattern-based correlation exploitation in big partitioned data seriesThe VLDB Journal10.1007/s00778-022-00767-932:3(665-688)Online publication date: 13-Oct-2022
  • (2021)High-Dimensional Similarity Search for Scalable Data Science2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00268(2369-2372)Online publication date: Apr-2021
  • (2021)Fast data series indexing for in-memory dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00677-230:6(1041-1067)Online publication date: 18-Jun-2021
  • (2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
  • (2020)Incrementalization of graph partitioning algorithmsProceedings of the VLDB Endowment10.14778/3389133.338914213:8(1261-1274)Online publication date: 3-May-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media