poster

Scalable kNN search on vertically stored time series

Authors:

Shrikant Kashyap,

Panagiotis KarrasAuthors Info & Claims

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1334 - 1342

https://doi.org/10.1145/2020408.2020607

Published: 21 August 2011 Publication History

Abstract

Nearest-neighbor search over time series has received vast research attention as a basic data mining task. Still, none of the hitherto proposed methods scales well with increasing time-series length. This is due to the fact that all methods provide an one-off pruning capacity only. In particular, traditional methods utilize an index to search in a reduced-dimensionality feature space; however, for high time-series length, search with such an index yields many false hits that need to be eliminated by accessing the full records. An attempt to reduce false hits by indexing more features exacerbates the curse of dimensionality, and vice versa. A recently proposed alternative, iSAX, uses symbolic approximate representations accessed by a simple file-system directory as an index. Still, iSAX also encounters false hits, which are again eliminated by accessing records in full: once a false hit is generated by the index, there is no second chance to prune it; thus, the pruning capacity iSAX provides is also one-off. This paper proposes an alternative approach to time series kNN search, following a nontraditional pruning style. Instead of navigating through candidate records via an index, we access their features, obtained by a multi-resolution transform, in a stepwise sequential-scan manner, one level of resolution at a time, over a vertical representation. Most candidates are progressively eliminated after a few of their terms are accessed, using pre-computed information and an unprecedentedly tight double-bounding scheme, involving not only lower, but also upper distance bounds. Our experimental study with large, high-length time-series data confirms the advantage of our approach over both the current state-of-the-art method, iSAX, and classical index-based methods.

References

[1]

iSAX page. http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html.

[2]

R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In FODO, 1993.

Digital Library

[3]

S. Arya, T. Malamatos, and D. M. Mount. Space-time tradeoffs for approximate nearest neighbor searching. J. ACM, 57(1), 2009.

Digital Library

[4]

I. Assent, R. Krieger, F. Afschari, and T. Seidl. The TStree: efficient time series search and retrieval. In EDBT, 2008.

Digital Library

[5]

N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The RStartree: an efficient and robust access method for points and rectangles. In SIGMOD, 1990.

Digital Library

[6]

S. Berchtold, C. Böhm, H. V. Jagadish, H.-P. Kriegel, and J. Sander. Independent quantization: An index compression technique for high-dimensional data spaces. In ICDE, 2000.

[7]

S. Berchtold, C. Böhm, and H.-P. Kriegal. The pyramid-technique: towards breaking the curse of dimensionality. In SIGMOD, 1998.

Digital Library

[8]

S. Berchtold, D. A. Keim, and H.-P. Kriegel. The Xtree: An index structure for high-dimensional data. In VLDB, 1996.

Digital Library

[9]

S. Blott and R. Weber. What's wrong with high-dimensional similarity search? PVLDB, 1(1):3--3, 2008.

Digital Library

[10]

C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322--373, 2001.

Digital Library

[11]

Y. Cai and R. Ng. Indexing spatio-temporal trajectories with Chebyshev polynomials. In SIGMOD, 2004.

Digital Library

[12]

K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM TODS, 27(2):188--228, 2002.

Digital Library

[13]

K. Chakrabarti and S. Mehrotra. The hybrid tree: An index structure for high dimensional feature spaces. In ICDE, 1999.

[14]

K.-P. Chan and A. W.-C. Fu. Efficient time series matching by wavelets. In ICDE, 1999.

[15]

L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD, 2005.

Digital Library

[16]

Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. Indexable PLA for efficient similarity search. In VLDB, 2007.

Digital Library

[17]

P. Ciaccia, M. Patella, and P. Zezula. \mtree: An efficient access method for similarity search in metric spaces. In VLDB, 1997.

Digital Library

[18]

B. Cui, B. C. Ooi, J. Su, and K.-L. Tan. Indexing high-dimensional data for efficient in-memory similarity search. IEEE TKDE, 17(3):339--353, 2005.

Digital Library

[19]

A. P. de~Vries, N. Mamoulis, N. Nes, and M. Kersten. Efficient kNN search on vertically decomposed data. In SIGMOD, 2002.

Digital Library

[20]

H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB, 1(2), 2008.

Digital Library

[21]

C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, 1994.

Digital Library

[22]

P. Geurts. Pattern extraction for time series classification. In PKDD, 2001.

Digital Library

[23]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.

Digital Library

[24]

A. Guttman. Rtrees: a dynamic index structure for spatial searching. In SIGMOD, 1984.

Digital Library

[25]

J. M. Hellerstein, E. Koutsoupias, and C. H. Papadimitriou. On the analysis of indexing schemes. In PODS, 1997.

Digital Library

[26]

G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM TODS, 24(2):265--318, 1999.

Digital Library

[27]

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.

Digital Library

[28]

H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+tree-based indexing method for nearest neighbor search. ACM TODS, 30(2):364--397, 2005.

Digital Library

[29]

K. V. R. Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In SIGMOD, 1998.

Digital Library

[30]

P. Karras and N. Mamoulis. One-pass wavelet synopses for maximum-error metrics. In VLDB, 2005.

Digital Library

[31]

P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. In ICDE, 2007.

[32]

N. Katayama and S. Satoh. The SRtree: an index structure for high-dimensional nearest neighbor queries. In SIGMOD, 1997.

Digital Library

[33]

E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst., 3(3):263--286, 2001.

[34]

F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In SIGMOD, 1997.

Digital Library

[35]

F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. Fast nearest neighbor search in medical image databases. In VLDB, 1996.

Digital Library

[36]

A. Koski, M. Juhola, and M. Meriste. Syntactic recognition of ECG signals by attributed finite automata. Pattern Recognition, 28(12):1927--1940, 1995.

Digital Library

[37]

J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. In SIGMOD, 1999.

Digital Library

[38]

X. Lian, L. Chen, J. X. Yu, J. Han, and J. Ma. Multiscale representations for fast pattern matching in stream time series. IEEE TKDE, 21(4):568--581, 2009.

Digital Library

[39]

J. Lin, E. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov., 15(2):107--144, 2007.

Digital Library

[40]

K. I. Lin, H. V. Jagadish, and C. Faloutsos. The TVtree: an index structure for high-dimensional data. The VLDB Journal, 3(4):517--542, 1994.

Digital Library

[41]

I. Popivanov and R. J. Miller. Similarity search over time-series data using wavelets. In ICDE, 2002.

[42]

N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, 1995.

Digital Library

[43]

Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima. The Atree: An index structure for high-dimensional spaces using relative approximation. In VLDB, 2000.

Digital Library

[44]

T. Seidl and H.-P. Kriegel. Optimal multi-step k-nearest neighbor search. In SIGMOD, 1998.

Digital Library

[45]

T. K. Sellis, N. Roussopoulos, and C. Faloutsos. The R+tree: A dynamic index for multi-dimensional objects. In VLDB, 1987.

Digital Library

[46]

J. Shieh and E. Keogh. iSAX: indexing and mining terabyte sized time series. In KDD, 2008.

Digital Library

[47]

H. Sun, Ö. Öztürk, and H. Ferhatosmanouglu. CoMRI: A compressed multi-resolution index structure for sequence similarity queries. In CSB, 2003.

Digital Library

[48]

Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, 2009.

Digital Library

[49]

R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.

Digital Library

[50]

D. Wu, A. Singh, D. Agrawal, A. El~Abbadi, and T. R. Smith. Efficient retrieval for browsing large image databases. In CIKM, 1996.

Digital Library

[51]

H. Wu, B. Salzberg, and D. Zhang. Online event-driven subsequence matching over financial data streams. In SIGMOD, 2004.

Digital Library

[52]

M.-Y. Yeh, K.-L. Wu, P. S. Yu, and M.-S. Chen. LEEWAVE: Level-wise distribution of wavelet coefficients for processing kNN queries over distributed streams. In VLDB, 2008.

Digital Library

[53]

B.-K. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary L_p norms. In VLDB, 2000.

Digital Library

[54]

C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to kNN processing. In VLDB, 2001.

Digital Library

Cited By

Tran MLe XHuynh VYoon S(2024)PISD: A linear complexity distance beats dynamic time warping on time series classification and clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109222138(109222)Online publication date: Dec-2024
https://doi.org/10.1016/j.engappai.2024.109222
Paparrizos JWu KElmore AFaloutsos CFranklin M(2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.14778/3594512.3594530
Mortensen KZardbani FHaque MAgustsson SMottin DHofmann PKarras P(2023)Marigold: Efficient k-Means Clustering in High DimensionsProceedings of the VLDB Endowment10.14778/3587136.358714716:7(1740-1748)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.14778/3587136.3587147
Show More Cited By

Index Terms

Scalable kNN search on vertically stored time series
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information systems applications
    1. Data mining
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Efficient k-nearest neighbor searching in nonordered discrete data spaces

Numerous techniques have been proposed in the past for supporting efficient k-nearest neighbor (k-NN) queries in continuous data spaces. Limited work has been reported in the literature for k-NN queries in a nonordered discrete data space (NDDS). ...
Index Structures for Fast Similarity Search for Binary Vectors

This article reviews index structures for fast similarity search for objects represented by binary vectors (with components equal to 0 or 1). Structures for both exact and approximate search by Hamming distance and other similarity measures are ...
Index Structures for Fast Similarity Search for Real Vectors. II*

This survey article considers index structures for fast similarity search for objects represented by real-valued vectors. Structures for both exact and faster but approximate similarity search are considered. Index structures based on partitioning into ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2011

1446 pages

ISBN:9781450308137

DOI:10.1145/2020408

General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

KDD '11

Sponsor:

KDD '11: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2011

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
675
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)7

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tran MLe XHuynh VYoon S(2024)PISD: A linear complexity distance beats dynamic time warping on time series classification and clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109222138(109222)Online publication date: Dec-2024
https://doi.org/10.1016/j.engappai.2024.109222
Paparrizos JWu KElmore AFaloutsos CFranklin M(2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.14778/3594512.3594530
Mortensen KZardbani FHaque MAgustsson SMottin DHofmann PKarras P(2023)Marigold: Efficient k-Means Clustering in High DimensionsProceedings of the VLDB Endowment10.14778/3587136.358714716:7(1740-1748)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.14778/3587136.3587147
Chatzigeorgakidis GSkoutas DPatroumpas KPalpanas TAthanasiou SSkiadopoulos S(2022)Efficient Range and kNN Twin Subsequence Search in Time SeriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3167257(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3167257
Paparrizos JEdian ILiu CElmore AFranklin M(2022)Fast Adaptive Similarity Search through Variance-Aware Quantization2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00268(2969-2983)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00268
Zhang LAlghamdi NZhang HEltabakh MRundensteiner E(2022)PARROT: pattern-based correlation exploitation in big partitioned data seriesThe VLDB Journal10.1007/s00778-022-00767-932:3(665-688)Online publication date: 13-Oct-2022
https://doi.org/10.1007/s00778-022-00767-9
Echihabi KZoumpatianos KPalpanas T(2021)High-Dimensional Similarity Search for Scalable Data Science2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00268(2369-2372)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00268
Peng BFatourou PPalpanas T(2021)Fast data series indexing for in-memory dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00677-230:6(1041-1067)Online publication date: 18-Jun-2021
https://dl.acm.org/doi/10.1007/s00778-021-00677-2
Bernstein PMadhavan JRahm E(2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402707.3402710
Fan WLiu MTian CXu RZhou J(2020)Incrementalization of graph partitioning algorithmsProceedings of the VLDB Endowment10.14778/3389133.338914213:8(1261-1274)Online publication date: 3-May-2020
https://dl.acm.org/doi/10.14778/3389133.3389142
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents