Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/773153.773176acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Maintaining variance and k-medians over data stream windows

Published: 09 June 2003 Publication History

Abstract

The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model---maintaining variance and maintaining a k--median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most ε using O(1/ε2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O(k/τ4 N log2 N) memory, where τ < 1/2 is a parameter which trades off the space bound with the approximation factor of O(2O(1/τ)).

References

[1]
V. Arya, N. Garg, R. Khandekar, V. Pandit, A. Meyerson, and K. Munagala. "Local Search Heuristics for k--median and Facility Location Problems." In Proc. 33rd ACM Symp. on Theory of Computing (STOC), 2001, pages 21--29.
[2]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. "Models and Issues in Data Stream Systems." In Proceedings of the 21st ACM Symposium on Principles of Databases Systems (PODS), 2002, pages 1--16.
[3]
B. Babcock, M. Datar, and R. Motwani. "Sampling from a Moving Window over Streaming Data." In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002, pages 633--634.
[4]
P. S. Bradley, U. M. Fayyad, and C. Reina. "Scaling clustering algorithms to large databases." In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), 1998, pages 9--15.
[5]
M. Charikar and S. Guha. "Improved Combinatorial Algorithms for the Facility Location and k--median Problems." In Proc. of the 40th Annual IEEE Symposium on Foundations of Computer Science(FOCS), 1999, pages 378--388.
[6]
M. Charikar, L. O'Callaghan, and R. Panigrahy. "Better Streaming Algorithms for Clustering Problems." In Proc. of 35th ACM Symposium on Theory of Computing (STOC), 2003.
[7]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. "Maintaining Stream Statistics over Sliding Windows." In Proceedings of Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002, pages 635--644.
[8]
P. Domingos and G. Hulten. "Mining High-speed Data Streams." In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (KDD), 2000, pages 71--80.
[9]
P. Domingos, G. Hulten, and L. Spencer. "Mining Time-changing Data Streams." In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (KDD), 2001, pages 97--106.
[10]
P. Gibbons, and S. Tirthapura. "Distributed Streams Algorithms for Sliding Windows" In Proc. of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2002.
[11]
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. "Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries." In Proc. 27th Conf. on Very Large Data Bases (VLDB), 2001, pages 79--88.
[12]
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. "Clustering Data Streams." In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2000, pages 359--366.
[13]
S. Guha, R. Rastogi, and K. Shim. "CURE: An efficient clustering algorithm for large databases." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 1998, pages 73--84.
[14]
M.R. Henzinger, P. Raghavan, and S. Rajagopalan "Computing on Data Streams.", Technical Report 1998--011, Compaq Systems Research Center, Palo Alto, CA, May, 1998.
[15]
P. Indyk. "Sublinear Time Algorithms for Metric Space Problems." In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), 1999, pages 428--434.
[16]
K. Jain and V. Vazirani. "Primal-Dual Approximation Algorithms for Metric Facility Location and k--median Problems." In Proc. 40th IEEE Symp. on Foundations of Computer Science (FOCS), 1999, pages 1--10.
[17]
R. R. Mettu and C. G. Plaxton. "The Online k--median Problem." In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2000, pages 339--348.
[18]
N. Mishra, D. Oblinger, and L. Pitt. "Sublinear Time Approximate Clustering." In Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001, pages 439--447.
[19]
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. "Streaming-Data Algorithms for High-Quality Clustering." In Proceedings of the 18th Annual IEEE International Conference on Data Engineering (ICDE), 2001.
[20]
C. R. Palmer and C. Faloutsos. "Density biased sampling: an improved method for data mining and clustering." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 2000, pages 82--92.
[21]
D. Pelleg and A. W. Moore. "Accelerating exact k--means algorithms with geometric reasoning." In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD), 1999.
[22]
T. Zhang, R. Ramakrishnan, and M. Livny. "BIRCH: an efficient data clustering method for very large databases." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 1996, pages 103--114.

Cited By

View all
  • (2023)Near-optimal k-clustering in the sliding window modelProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667116(22934-22960)Online publication date: 10-Dec-2023
  • (2022)Review of Works Content Analyzer for Information Leakage Detection and Prevention in Android Smart DevicesABUAD International Journal of Natural and Applied Sciences10.53982/aijnas.2022.0201.02-j2:1(12-28)Online publication date: 30-Mar-2022
  • (2022)Data Stream ManagementundefinedOnline publication date: 24-Feb-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2003
291 pages
ISBN:1581136706
DOI:10.1145/773153
  • Conference Chair:
  • Frank Neven,
  • General Chair:
  • Catriel Beeri,
  • Program Chair:
  • Tova Milo
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2003

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS03

Acceptance Rates

PODS '03 Paper Acceptance Rate 27 of 136 submissions, 20%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Near-optimal k-clustering in the sliding window modelProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667116(22934-22960)Online publication date: 10-Dec-2023
  • (2022)Review of Works Content Analyzer for Information Leakage Detection and Prevention in Android Smart DevicesABUAD International Journal of Natural and Applied Sciences10.53982/aijnas.2022.0201.02-j2:1(12-28)Online publication date: 30-Mar-2022
  • (2022)Data Stream ManagementundefinedOnline publication date: 24-Feb-2022
  • (2021)A Clustering Algorithm in Stream Data Using Strong CoresetJournal of Interconnection Networks10.1142/S021926592143011822:Supp02Online publication date: 6-Dec-2021
  • (2021)Statistical Analysis of Pairwise ConnectivityDiscovery Science10.1007/978-3-030-88942-5_11(138-148)Online publication date: 11-Oct-2021
  • (2021)Real-Time News Grouping: Detecting the Same-Content News on Turkish News StreamProgress in Intelligent Decision Science10.1007/978-3-030-66501-2_2(14-25)Online publication date: 30-Jan-2021
  • (2020)Density-Based Clustering Method for Trends Analysis Using Evolving Data StreamInternational Journal of Synthetic Emotions10.4018/IJSE.202007010211:2(19-36)Online publication date: 1-Jul-2020
  • (2020)Detection of Driving Capability Degradation for Human-Machine Cooperative DrivingSensors10.3390/s2007196820:7(1968)Online publication date: 1-Apr-2020
  • (2020)A Stream Clustering Algorithm of Self-Adaptive Method2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications( AEECA)10.1109/AEECA49918.2020.9213515(781-786)Online publication date: Aug-2020
  • (2020)Driving Capability-Based Transition Strategy for Cooperative Driving: From Manual to AutomaticIEEE Access10.1109/ACCESS.2020.30126718(139013-139022)Online publication date: 2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media