Abstract
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of \( O + (\sqrt{G/W}) \). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Babcock B, Babu S, Datar M, Windom J. Model and issues in data stream systems. In Proc. Principles of Database Systems (PODS), Wisconsin, USA, June 2002, pp.1–16.
Golab L, Ozsu M T. Issues in data stream management. SIGMOD Record, June 2003, 32(2): 5–14.
Metwally A, Agrawal D, Abbadi A E. Duplicate detection in click streams. In Proc. 14th Int. Conf. World Wide Web, Chiba, Japan, May 2005, pp.12–21.
Deng F, Rafiei D. Approximately detecting duplicates for streaming data using stable Bloom filters. In Proc. ACM SIGMOD Conf., New York, USA, June 2006, pp.25–36.
Anupam V, Mayer A, Nissim K, Pinkas B, Reiter M. On the security of pay-per-click and other web advertising schemes. In Proc. 8th Int. Conf. World Wide Web, Toronto, Canada, May 1999, pp.1091–1100.
Heydon A, Najork M Mercator: A scalable, extensible web crawler. In Proc. 8th Int. Conf. World Wide Web, Toronto, Canada, May 1999, pp.219–229.
Broder A Z, Najork M, Wiener J L. Efficient URL caching for world wide web crawling. In Proc. 12th Int. Conf. World Wide Web, Budapest, Hungary, May 2003, pp.680–689.
Cisco network accounting services white paper. Cisco System Inc, 2002, pp.1–20, http://www.cisco.com/warp/public/cc/pd/iosw/prodlit/nwactwp.pdf.
Garcia-Molina H, Ullman J D, Widom J. Database System Implementation. Upper Saddle River: Prentice Hall, Inc., NJ, USA, 1999.
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. ACM SIGKDD Conf., Washington DC, USA, August 2003, pp.39–48.
Weis M, Naumann F. Dogmatix tracks down duplicates in XML. In Proc. ACM SIGMOD Conf., Baltimore, Maryland, USA, June 2005, pp.431–442.
Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In Proc. 28th Int. Conf. Very Large Data Bases, Hong Kong, China, 2002, pp.586–597.
Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates. In Proc. 21st Int. Conf. Data Engineering (ICDE), Tokyo, Japan, 2005, pp.865–876.
Bloom B H. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, July 1970, 13(7): 422–426.
Fan L, Cao P, Almeida J, Broder A Z. Summary cache: A scalable wide-area Web cache sharing protocol. IEEE/ACM Trans. Networking, 2000, 8(3): 281–293.
Mitzenmacher M. Compressed Bloom filters. In Proc. 20th ACM Symposium on Principles of Distributed Computing Netw. Rhode Island, 2001, pp.144–150.
Cohen S, Matias Y. Spectral Bloom filters. In Proc. ACM SIGMOD Conf., California, USA, June 2003, pp.241–252.
Whang K, Zenden B T V, Taylor H M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 1990, 15(2): 208–299.
Cheng K, Xiang L, Iwaihara M. Time-decaying Bloom filters for data streams with skewed distributions. In Proc. 15th Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA), Tokyo, Japan, 2005, pp.63–69.
Cohen E, Strauss M. Maintaining time-decaying stream aggregates. In Proc. Principles of Database Systems (PODS), San Diego, California, USA, June 2003, pp.223–233.
Cormode G, Tirthapura S, Xu B. Time-decaying sketches for sensor data aggregation. In Proc. Principles of Distributed Computing (PODC), Portland, Oregon, USA, May 2007, pp.215–224.
Arasu A, Manku G. Approximate counts and quanlites over sliding windows. In Proc. Principles of Database Systems (PODS), Paris, France, June 2004, pp.286–296.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by the “Hundred Talents Program” of CAS and the National Natural Science Foundation of China under Grant No. 60772034.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Shen, H., Zhang, Y. Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows. J. Comput. Sci. Technol. 23, 973–987 (2008). https://doi.org/10.1007/s11390-008-9192-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-008-9192-1