Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2806416.2806440acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Fast Distributed Correlation Discovery Over Streaming Time-Series Data

Published: 17 October 2015 Publication History

Abstract

The dramatic rise of time-series data in a variety of contexts, such as social networks, mobile sensing, data centre monitoring, etc., has fuelled interest in obtaining real-time insights from such data using distributed stream processing systems. One such extremely valuable insight is the discovery of correlations in real-time from large-scale time-series data. A key challenge in discovering correlations is that the number of time-series pairs that have to be analyzed grows quadratically in the number of time-series, giving rise to a quadratic increase in both computation cost and communication cost between the cluster nodes in a distributed environment. To tackle the challenge, we propose a framework called AEGIS. AEGIS exploits well-established statistical properties to dramatically prune the number of time-series pairs that have to be evaluated for detecting interesting correlations. Our extensive experimental evaluations on real and synthetic datasets establish the efficacy of AEGIS over baselines.

References

[1]
Tech. report -- http://infoscience.epfl.ch/record/210363.
[2]
Storm. http://storm-project.net/.
[3]
D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A new model and architecture for data stream management. VLDB Journal, pages 120--139, 2003.
[4]
L. Abraham, J. Allen, O. Barykin, V. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, et al. Scuba: diving into data at facebook. PVLDB, 6(11):1057--1067, 2013.
[5]
R. Cole, D. Shasha, and X. Zhao. Fast window correlations over uncooperative time series. In ACM SIGKDD, pages 743--749. ACM, 2005.
[6]
S. Fries, B. Boden, G. Stepien, and T. Seidl. Phidj: Parallel similarity self-join for high-dimensional vector data with mapreduce. In ICDE, pages 796--807. IEEE, 2014.
[7]
Y. Li, M. L. Yiu, Z. Gong, et al. Discovering longest-lasting correlation in sequence databases. PVLDB Endowment, 6(14):1666--1677, 2013.
[8]
A. Mueen, S. Nath, and J. Liu. Fast approximate correlation for massive time-series data. In SIGMOD, pages 171--182, 2010.
[9]
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In ICDMW, pages 170--177. IEEE, 2010.
[10]
C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format hema. Google Inc., White Paper, 2011.
[11]
Y. Sakurai, S. Papadimitriou, and C. Faloutsos. Braid: Stream mining through group lag correlations. In ACM SIGMOD, pages 599--610. ACM, 2005.
[12]
A. D. Sarma, Y. He, and S. Chaudhuri. Clusterjoin: A similarity joins framework using map-reduce. In PVLDB, 2014.
[13]
S. Sathe and K. Aberer. AFFINITY: Efficiently querying statistical measures on time- series data. In ICDE, pages 841--852, 2013.
[14]
N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. PVLDB, 6:1930--1941, 2013.
[15]
Y. Wang, A. Metwally, and S. Parthasarathy. Scalable all-pairs similarity search in metric spaces. In 19th ACM SIGKDD, pages 829--837. ACM, 2013.
[16]
D. Wu, Y. Ke, J. X. Yu, S. Y. Philip, and L. Chen. Leadership discovery when data correlatively evolve. World Wide Web, 14(1):1--25, 2011.
[17]
Q. Xie, S. Shang, B. Yuan, C. Pang, and X. Zhang. Local correlation detection with linearity enhancement in streaming data. In ACM CIKM, pages 309--318. ACM, 2013.
[18]
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In 4th USENIX HotCloud, pages 10--10. USENIX, 2012.
[19]
T. Zhang, D. Yue, Y. Gu, and G. Yu. Boolean representation based data-adaptive correlation analysis over time series streams. In CIKM, pages 203--212. ACM, 2007.
[20]
Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In VLDB, pages 358--369, 2002.

Cited By

View all
  • (2024)Static and Streaming Discovery of Maximal Linear Representation Between Time SeriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328727336:1(401-415)Online publication date: Jan-2024
  • (2020)Correlation-based search for time series dataInternational Journal of Computer Applications in Technology10.1504/ijcat.2020.10468462:2(158-174)Online publication date: 1-Jan-2020
  • (2020)Dominant Data Set Selection Algorithms for Electricity Consumption Time-Series Data Analysis Based on Affine TransformationIEEE Internet of Things Journal10.1109/JIOT.2019.29467537:5(4347-4360)Online publication date: May-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
October 2015
1998 pages
ISBN:9781450337946
DOI:10.1145/2806416
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate algorithm
  2. distributed computing
  3. stream processing
  4. time series analysis

Qualifiers

  • Research-article

Funding Sources

  • OpenSenseII project of Nano-Tera.ch

Conference

CIKM'15
Sponsor:

Acceptance Rates

CIKM '15 Paper Acceptance Rate 165 of 646 submissions, 26%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Static and Streaming Discovery of Maximal Linear Representation Between Time SeriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328727336:1(401-415)Online publication date: Jan-2024
  • (2020)Correlation-based search for time series dataInternational Journal of Computer Applications in Technology10.1504/ijcat.2020.10468462:2(158-174)Online publication date: 1-Jan-2020
  • (2020)Dominant Data Set Selection Algorithms for Electricity Consumption Time-Series Data Analysis Based on Affine TransformationIEEE Internet of Things Journal10.1109/JIOT.2019.29467537:5(4347-4360)Online publication date: May-2020
  • (2019)Correlation Set Discovery on Time-Series DataDatabase and Expert Systems Applications10.1007/978-3-030-27618-8_21(275-290)Online publication date: 6-Aug-2019
  • (2019)Runtime Service Composition Modification Supporting Situational Sensor Data CorrelationService-Oriented Computing – ICSOC 2018 Workshops10.1007/978-3-030-17642-6_15(169-181)Online publication date: 10-Apr-2019
  • (2018)A Service-Based Method for Multiple Sensor Streams Aggregation in Fog ComputingWireless Communications & Mobile Computing10.1155/2018/84756042018Online publication date: 1-Jan-2018
  • (2018)RELATEProceedings of the 4th ACM International Conference of Computing for Engineering and Sciences10.1145/3213187.3287608(1-10)Online publication date: 6-Jul-2018
  • (2018)ParCorr: efficient parallel methods to identify similar time series pairs across sliding windowsData Mining and Knowledge Discovery10.1007/s10618-018-0580-z32:5(1481-1507)Online publication date: 7-Aug-2018
  • (2018)Low Redundancy Estimation of Correlation Matrices for Time Series Using Triangular BoundsAdvances in Knowledge Discovery and Data Mining10.1007/978-3-319-93037-4_36(458-470)Online publication date: 20-Jun-2018
  • (2018)A Service-Based Declarative Approach for Capturing Events from Multiple Sensor StreamsService-Oriented Computing10.1007/978-3-030-03596-9_17(255-263)Online publication date: 7-Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media