research-article

Indexing for interactive exploration of big data series

Authors:

Kostas Zoumpatianos,

Stratos Idreos,

Themis PalpanasAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1555 - 1566

https://doi.org/10.1145/2588555.2610498

Published: 18 June 2014 Publication History

Abstract

Numerous applications continuously produce big amounts of data series, and in several time critical scenarios analysts need to be able to query these data as soon as they become available, which is not currently possible with the state-of-the-art indexing methods and for very large data series collections. In this paper, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. The main idea is that instead of building the complete index over the complete data set up-front and querying only later, we interactively and adaptively build parts of the index, only for the parts of the data on which the users pose queries. The net effect is that instead of waiting for extended periods of time for the index creation, users can immediately start exploring the data series. We present a detailed design and evaluation of adaptive data series indexing over both synthetic data and real-world workloads. The results show that our approach can gracefully handle large data series collections, while drastically reducing the data to query delay: by the time state-of-the-art indexing techniques finish indexing 1 billion data series (and before answering even a single query), adaptive data series indexing has already answered $3*10^5$ queries.

References

[1]

R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In FODO, pages 69--84, 1993.

Digital Library

[2]

I. Assent, R. Krieger, F. Afschari, and T. Seidl. The TS-tree: efficient time series search and retrieval. In EDBT, pages 252--263, 2008.

Digital Library

[3]

J. L. Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM, 18(9):509--517, Sept. 1975.

Digital Library

[4]

S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An index structure for high-dimensional data. In VLDB, pages 28--39, 1996.

Digital Library

[5]

Y. Bu, T. wing Leung, A. W. chee Fu, E. Keogh, J. Pei, and S. Meshkin. Wat: Finding top-k discords in time series database. In SDM, pages 449--454, 2007.

[6]

A. Camerra, T. Palpanas, J. Shieh, and E. Keogh. iSAX 2.0: Indexing and mining one billion time series. In ICDM, pages 58--67, 2010.

Digital Library

[7]

A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. Keogh. Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. KAIS, 39(1):123--151, 2014.

Digital Library

[8]

K.-P. Chan and A.-C. Fu. Efficient time series matching by wavelets. In ICDE, pages 126--133, 1999.

Digital Library

[9]

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: a survey. ACM Computing Surveys, 41(3):1--58, 2009.

Digital Library

[10]

C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, pages 419--429, 1994.

Digital Library

[11]

G. Graefe, F. Halim, S. Idreos, H. A. Kuno, and S. Manegold. Concurrency control for adaptive indexing. PVLDB, 5(7):656--667, 2012.

Digital Library

[12]

G. Graefe, F. Halim, S. Idreos, H. A. Kuno, S. Manegold, and B. Seeger. Transactional support for adaptive indexing. VLDB J., 23(2):303--328, 2014.

[13]

A. Guttman. R-Trees A Dynamic Structure for Spatial Searching. In SIGMOD, pages 47--57, 1984.

Digital Library

[14]

F. Halim, S. Idreos, P. Karras, and R. H. C. Yap. Stochastic database cracking: Towards robust adaptive indexing in main-memory column-stores. PVLDB, 5(6):502--513, 2012.

Digital Library

[15]

S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are my Queries. Where are my Results? In CIDR, pages 57--68, 2011.

[16]

S. Idreos, M. L. Kersten, and S. Manegold. Updating a Cracked Database. In SIGMOD, pages 413--424, 2007.

Digital Library

[17]

S. Idreos, M. L. Kersten, and S. Manegold. Database Cracking. In CIDR, pages 68--78, 2007.

[18]

S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing Tuple Reconstruction in Column-stores. In SIGMOD, pages 297--308, 2009.

Digital Library

[19]

S. Idreos and E. Liarou. dbtouch: Analytics at your fingertips. In CIDR, 2013.

[20]

S. Idreos, S. Manegold, H. A. Kuno, and G. Graefe. Merging what's cracked, cracking what's merged: Adaptive indexing in main-memory column-stores. PVLDB, 4(9):585--597, 2011.

Digital Library

[21]

H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33(1):117--128, 2011.

Digital Library

[22]

E. Keogh, K. Chakrabarti, and M. Pazzani. Dimensionality reduction for fast similarity search in large time series databases. KAIS, 3(3):263--286, 2000.

[23]

J. Lin, E. Keogh, and S. Lonardi. A symbolic representation of time series, with implications for streaming algorithms. In DMKD, pages 2--11, 2003.

Digital Library

[24]

T. Palpanas, M. Vlachos, E. J. Keogh, and D. Gunopulos. Streaming time series summarization using user-defined amnesic functions. TKDE, 20(7):992--1006, 2008.

Digital Library

[25]

D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. In SIGMOD, pages 13--25, 1997.

Digital Library

[26]

T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In SIGKDD, pages 262--270, 2012.

Digital Library

[27]

S. Richter, J.-A. Quiane-Ruiz, S. Schuh, and J. Dittrich. Towards zero-overhead static and adaptive indexing in hadoop. VLDBJ, 2013.

Digital Library

[28]

F. M. Schuhknecht, A. Jindal, and J. Dittrich. The Uncracked Pieces in Database Cracking. PVLDB, 7(2):97--108, 2013.

[29]

J. Shieh and E. Keogh. iSAX: Indexing and Mining Terabyte Sized Time Series. In SIGKDD, pages 623--631, 2008.

Digital Library

[30]

J. Shieh and E. Keogh. iSAX: disk-aware mining and indexing of massive time series datasets. DMKD, 19(1):24--57, 2009.

Digital Library

[31]

M. Stonebraker. The case for partial indexes. SIGMOD Rec., 18(4):4--11, 1989.

Digital Library

[32]

Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB, 6(10):793--804, 2013.

Digital Library

[33]

T. Warren Liao. Clustering of time series data - a survey. Pattern Recognition, 38(11):1857--1874, 2005.

Digital Library

[34]

B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary lp norms. In VLDB, pages 385--394, 2000.

Digital Library

[35]

J. Zhou and K. A. Ross. Buffering accesses to memory-resident index structures. In VLDB, pages 405--416, 2003.

Digital Library

[36]

J. Zhou and K. A. Ross. Buffering database operations for enhanced instruction cache performance. In SIGMOD, pages 191--202, 2004.

Digital Library

Cited By

Wei JLee XLiao ZPalpanas TPeng B(2025)Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/37097293:1(1-29)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709729
Wang QIleana IPalpanas T(2025)LeaFi: Data Series Indexes on Steroids with Learned FiltersProceedings of the ACM on Management of Data10.1145/37097013:1(1-27)Online publication date: 11-Feb-2025
https://doi.org/10.1145/3709701
Dawod AGeorgakopoulos DJayaraman PNirmalathas A(2024)A Survey of Techniques for Discovering, Using, and Paying for Third-Party IoT SensorsSensors10.3390/s2408253924:8(2539)Online publication date: 15-Apr-2024
https://doi.org/10.3390/s24082539
Show More Cited By

Index Terms

Indexing for interactive exploration of big data series
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Efficient k-nearest neighbor searching in nonordered discrete data spaces

Numerous techniques have been proposed in the past for supporting efficient k-nearest neighbor (k-NN) queries in continuous data spaces. Limited work has been reported in the literature for k-NN queries in a nonordered discrete data space (NDDS). ...
Towards zero-overhead static and adaptive indexing in Hadoop

Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases. Recent works have shown that indexes can improve the performance of selective MapReduce jobs ...
Generating data series query workloads

Data series (including time series) has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
1,139
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wei JLee XLiao ZPalpanas TPeng B(2025)Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/37097293:1(1-29)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709729
Wang QIleana IPalpanas T(2025)LeaFi: Data Series Indexes on Steroids with Learned FiltersProceedings of the ACM on Management of Data10.1145/37097013:1(1-27)Online publication date: 11-Feb-2025
https://doi.org/10.1145/3709701
Dawod AGeorgakopoulos DJayaraman PNirmalathas A(2024)A Survey of Techniques for Discovering, Using, and Paying for Third-Party IoT SensorsSensors10.3390/s2408253924:8(2539)Online publication date: 15-Apr-2024
https://doi.org/10.3390/s24082539
Hu HQiu JWang HLiang BZou S(2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665851
Yang JJoo HYerramreddy SMoritz DBattle L(2024)Optimizing Dataflow Systems for Scalable Interactive VisualizationProceedings of the ACM on Management of Data10.1145/36392762:1(1-25)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639276
Wang ZWang QWang PPalpanas TWang W(2024)DumpyOS: A data-adaptive multi-ary index for scalable data series similarity searchThe VLDB Journal10.1007/s00778-024-00874-933:6(1887-1911)Online publication date: 21-Aug-2024
https://doi.org/10.1007/s00778-024-00874-9
Liu ZKang CXing X(2023)Querying Similar Multi-Dimensional Time Series with a Spatial DatabaseISPRS International Journal of Geo-Information10.3390/ijgi1204017912:4(179)Online publication date: 21-Apr-2023
https://doi.org/10.3390/ijgi12040179
Chatzakis MFatourou PKosmas EPalpanas TPeng B(2023)Odyssey: A Journey in the Land of Distributed Data Series Similarity SearchProceedings of the VLDB Endowment10.14778/3579075.357908716:5(1140-1153)Online publication date: 6-Mar-2023
https://doi.org/10.14778/3579075.3579087
Wang ZWang QWang PPalpanas TWang W(2023)Dumpy: A Compact and Adaptive Index for Large Data Series CollectionsProceedings of the ACM on Management of Data10.1145/35889651:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588965
Raman ASarkar SOlma MAthanassoulis M(2023)Indexing for Near-Sorted Data2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00117(1475-1488)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00117
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten