research-article

Overcoming browser cookie churn with clustering

Authors:

Anirban Dasgupta,

Maxim Gurevich,

Achint O. ThomasAuthors Info & Claims

WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

Pages 83 - 92

https://doi.org/10.1145/2124295.2124308

Published: 08 February 2012 Publication History

Abstract

Many large Internet websites are accessed by users anonymously, without requiring registration or logging-in. However, to provide personalized service these sites build anonymous, yet persistent, user models based on repeated user visits. Cookies, issued when a web browser first visits a site, are typically employed to anonymously associate a website visit with a distinct user (web browser). However, users may reset cookies, making such association short-lived and noisy. In this paper we propose a solution to the cookie churn problem: a novel algorithm for grouping similar cookies into clusters that are more persistent than individual cookies. Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems.

We present a novel method to cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns. We address each step of the clustering as a binary classification problem estimating the probability that two different subsets of cookies belong to the same browser. We observe that our clustering problem is a generalized interval graph coloring problem, and propose a greedy heuristic algorithm for solving it. The scalability of this method allows us to cluster hundreds of millions of browser cookies and provides significant improvements over baselines such as constrained K-means.

References

[1]

http://www.comscore.com/Press_Events/Press_ Releases/2007/04/comScore_Cookie_Deletion_Report.

[2]

D. Agarwal and B. Chen. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 19--28. ACM, 2009.

Digital Library

[3]

D. Agarwal, L. Li, and A. Smola. Linear-time estimators for propensity scores. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 15:93--100, 2011.

[4]

E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4):461--486, 2009.

Digital Library

[5]

M. Ankerst, M. Breunig, H. Kriegel, and J. Sander. OPTICS: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 49--60. ACM, 1999.

Digital Library

[6]

A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 79--85. Association for Computational Linguistics, 1998.

Digital Library

[7]

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1):89--113, 2004.

Digital Library

[8]

A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997.

Digital Library

[9]

G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions. Institut National de Recherche en Informatique et en Automatique, 1991.

[10]

I. Davidson and S. Ravi. Clustering With Constraints: Feasibility Issues and the K-Means Algorithm. In Proceedings of the Fifth SIAM International Conference on Data Mining, page 138. Society for Industrial Mathematics, 2005.

[11]

E. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs. Theoretical Computer Science, 361(2-3):172--187, 2006.

Digital Library

[12]

A. Demiriz, K. Bennett, and M. Embrechts. Semi-supervised clustering using genetic algorithms. Artificial neural networks in engineering (ANNIE-99), pages 809--814, 1999.

[13]

P. Eckersley. How unique is your web browser? http://panopticlick.eff.org/browser-uniqueness.pdf.

Digital Library

[14]

M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. KDD, volume 96, pages 226--231, 1996.

[15]

M. Fomitchev. Method and system for estimating unique visitors for internet sites. United States Patent Application 20100228850.

[16]

M. Golumbic. Algorithmic graph theory and perfect graphs. North-Holland, 2004.

Digital Library

[17]

N. Good, J. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl. Combining collaborative filtering with personal agents for better recommendations. In Proceedings of the National Conference on Artificial Intelligence, pages 439--446. JOHN WILEY & SONS LTD, 1999.

Digital Library

[18]

M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2):107--145, 2001.

Digital Library

[19]

M. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf. Support vector machines. Intelligent Systems and their Applications, IEEE, 13(4):18--28, 2002.

Digital Library

[20]

E. Jeábek. Dual weak pigeonhole principle, Boolean complexity, and derandomization. Annals of Pure and Applied Logic, 129(1-3):1--37, 2004.

[21]

R. Kass and A. Raftery. Bayes factors. Journal of the American Statistical Association, 90(430):773--795, 1995.

[22]

B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16--22. ACM, 1999.

Digital Library

[23]

Z. Li and J. Liu. Constrained clustering by spectral kernel learning. In Computer Vision, 2009 IEEE 12th International Conference on, pages 421--427. IEEE, 2010.

[24]

Z. Lu and M. Carreira-Perpinán. Constrained spectral clustering through affinity propagation. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1--8. IEEE, 2008.

[25]

J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, page 14. California, USA, 1967.

[26]

P. Mahalanobis. On the generalized distance in statistics. In Proceedings of the National Institute of Science, Calcutta, volume 12, page 49, 1936.

[27]

P. McCullagh and J. Nelder. Generalized linear models. Chapman & Hall, CRC, 1999.

[28]

M. Meila. Comparing clusterings by the variation of information. In Learning theory and Kernel machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003: proceedings, page 173. Springer Verlag, 2003.

[29]

M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge Univ Pr, 2005.

Digital Library

[30]

S. Park, D. Pennock, O. Madani, N. Good, and D. DeCoste. Naïve filterbots for robust cold-start recommendations. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 699--705. ACM, 2006.

Digital Library

[31]

G. Salton and M. McGill. Introduction to modern information retrieval. New York, 1983.

Digital Library

[32]

M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD workshop on text mining, volume 34, page 35. Citeseer, 2000.

[33]

C. Van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365--373, 1993.

[34]

K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 577--584. Citeseer, 2001.

Digital Library

[35]

K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207--244, 2009.

Digital Library

[36]

L. Zhang, D. Agarwal, and B. Chen. Generalizing matrix factorization through flexible regression priors. In Proceedings of the fifth ACM conference on Recommender systems, pages 13--20. ACM, 2011.

Digital Library

[37]

Y. Zhao and G. Karypis. Criterion functions for document clustering. Citeseer, 2005.

Cited By

Renjith SSreekumar AJathavedan M(2022)Taxonomy grooming algorithm ‐ An autodidactic domain specific dimensionality reduction approach for fast clustering of social media text dataConcurrency and Computation: Practice and Experience10.1002/cpe.683734:11Online publication date: Feb-2022
https://doi.org/10.1002/cpe.6837
Shekhar SPai DRavindran S(2020)Entity Resolution in Dynamic Heterogeneous NetworksCompanion Proceedings of the Web Conference 202010.1145/3366424.3391264(662-668)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3391264
Jin DHeimann MRossi RKoutra D(2020)node2bits: Compact Time- and Attribute-Aware Node Representations for User StitchingMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-46150-8_29(483-506)Online publication date: 30-Apr-2020
https://doi.org/10.1007/978-3-030-46150-8_29
Show More Cited By

Index Terms

Overcoming browser cookie churn with clustering
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Distributed programming languages

Recommendations

Browser wars: Metaphor, Web browser, Microsoft, Internet Explorer, Netscape, Netscape Navigator, Mozilla Firefox, Google Chrome, Safari (web browser), Opera (web browser), Usage share of web browsers
Cross Cookie: A Cookie Protocol for Web Mashups
ISECS '08: Proceedings of the 2008 International Symposium on Electronic Commerce and Security

In this paper, HTTP cookies, especially when they are used in Web mashups, are studied. A new cookie protocol named Cross Cookie is proposed. In a Cross Cookie, both information of where the cookie is originated and where the cookie is used on client-...
Browser with Clustering of Web Documents
ADCONS '13: Proceedings of the 2013 2nd International Conference on Advanced Computing, Networking and Security

Accessing relevant information quickly, given limited time and space, is a major issue in Web browsers, especially those in mobile devices. In this paper we propose a framework for grouping similar Web documents in a browser based on similar content of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

February 2012

792 pages

ISBN:9781450307475

DOI:10.1145/2124295

General Chairs:
Eytan Adar
University of Michigan, USA
,
Jaime Teevan
Microsoft Research, USA
,
Program Chairs:
Eugene Agichtein
Emory University, USA
,
Yoelle Maarek
Yahoo! Research, Israel

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 February 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM'12

Sponsor:

WSDM'12: Fifth ACM International Conference on Web Search and Data Mining

February 8 - 12, 2012

Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
741
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Renjith SSreekumar AJathavedan M(2022)Taxonomy grooming algorithm ‐ An autodidactic domain specific dimensionality reduction approach for fast clustering of social media text dataConcurrency and Computation: Practice and Experience10.1002/cpe.683734:11Online publication date: Feb-2022
https://doi.org/10.1002/cpe.6837
Shekhar SPai DRavindran S(2020)Entity Resolution in Dynamic Heterogeneous NetworksCompanion Proceedings of the Web Conference 202010.1145/3366424.3391264(662-668)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3391264
Jin DHeimann MRossi RKoutra D(2020)node2bits: Compact Time- and Attribute-Aware Node Representations for User StitchingMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-46150-8_29(483-506)Online publication date: 30-Apr-2020
https://doi.org/10.1007/978-3-030-46150-8_29
Esteller-Cucala MFernandez VVilluendas DPapadopoulos GSamaras GWeibelzahl SJannach DSantos O(2019)Experimentation Pitfalls to Avoid in A/B Testing for Online PersonalizationAdjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization10.1145/3314183.3323853(153-159)Online publication date: 6-Jun-2019
https://dl.acm.org/doi/10.1145/3314183.3323853
Li ZZheng ZWu FChen GAndre EKoenig SDastani MSukthankar G(2018)On Designing Optimal Data Purchasing Strategies for Online Ad AuctionsProceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3237383.3237927(1522-1530)Online publication date: 9-Jul-2018
https://dl.acm.org/doi/10.5555/3237383.3237927
Kim SKini NPujara JKoh EGetoor LBarrett RCummings RAgichtein EGabrilovich E(2017)Probabilistic Visitor Stitching on Cross-Device Web LogsProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052711(1581-1589)Online publication date: 3-Apr-2017
https://dl.acm.org/doi/10.1145/3038912.3052711
Wang LLee KLu QMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Improving Advertisement Recommendation by Enriching User Browser Cookie AttributesProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983374(2401-2404)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983374
Coey DBailey MBourdeau JHendler JNkambou RHorrocks IZhao B(2016)People and CookiesProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2882984(1103-1111)Online publication date: 11-Apr-2016
https://dl.acm.org/doi/10.1145/2872427.2882984
Dmitriev PFrasca BGupta SKohavi RVaz G(2016)Pitfalls of long-term online controlled experiments2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840744(1367-1376)Online publication date: Dec-2016
https://doi.org/10.1109/BigData.2016.7840744
White RAwadallah ABaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)Personalizing Search on Shared DevicesProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767736(523-532)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767736
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents