Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2124295.2124308acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Overcoming browser cookie churn with clustering

Published: 08 February 2012 Publication History

Abstract

Many large Internet websites are accessed by users anonymously, without requiring registration or logging-in. However, to provide personalized service these sites build anonymous, yet persistent, user models based on repeated user visits. Cookies, issued when a web browser first visits a site, are typically employed to anonymously associate a website visit with a distinct user (web browser). However, users may reset cookies, making such association short-lived and noisy. In this paper we propose a solution to the cookie churn problem: a novel algorithm for grouping similar cookies into clusters that are more persistent than individual cookies. Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems.
We present a novel method to cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns. We address each step of the clustering as a binary classification problem estimating the probability that two different subsets of cookies belong to the same browser. We observe that our clustering problem is a generalized interval graph coloring problem, and propose a greedy heuristic algorithm for solving it. The scalability of this method allows us to cluster hundreds of millions of browser cookies and provides significant improvements over baselines such as constrained K-means.

References

[1]
http://www.comscore.com/Press_Events/Press_ Releases/2007/04/comScore_Cookie_Deletion_Report.
[2]
D. Agarwal and B. Chen. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 19--28. ACM, 2009.
[3]
D. Agarwal, L. Li, and A. Smola. Linear-time estimators for propensity scores. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 15:93--100, 2011.
[4]
E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4):461--486, 2009.
[5]
M. Ankerst, M. Breunig, H. Kriegel, and J. Sander. OPTICS: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 49--60. ACM, 1999.
[6]
A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 79--85. Association for Computational Linguistics, 1998.
[7]
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1):89--113, 2004.
[8]
A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997.
[9]
G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions. Institut National de Recherche en Informatique et en Automatique, 1991.
[10]
I. Davidson and S. Ravi. Clustering With Constraints: Feasibility Issues and the K-Means Algorithm. In Proceedings of the Fifth SIAM International Conference on Data Mining, page 138. Society for Industrial Mathematics, 2005.
[11]
E. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs. Theoretical Computer Science, 361(2-3):172--187, 2006.
[12]
A. Demiriz, K. Bennett, and M. Embrechts. Semi-supervised clustering using genetic algorithms. Artificial neural networks in engineering (ANNIE-99), pages 809--814, 1999.
[13]
P. Eckersley. How unique is your web browser? http://panopticlick.eff.org/browser-uniqueness.pdf.
[14]
M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. KDD, volume 96, pages 226--231, 1996.
[15]
M. Fomitchev. Method and system for estimating unique visitors for internet sites. United States Patent Application 20100228850.
[16]
M. Golumbic. Algorithmic graph theory and perfect graphs. North-Holland, 2004.
[17]
N. Good, J. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl. Combining collaborative filtering with personal agents for better recommendations. In Proceedings of the National Conference on Artificial Intelligence, pages 439--446. JOHN WILEY & SONS LTD, 1999.
[18]
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2):107--145, 2001.
[19]
M. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf. Support vector machines. Intelligent Systems and their Applications, IEEE, 13(4):18--28, 2002.
[20]
E. Jeábek. Dual weak pigeonhole principle, Boolean complexity, and derandomization. Annals of Pure and Applied Logic, 129(1-3):1--37, 2004.
[21]
R. Kass and A. Raftery. Bayes factors. Journal of the American Statistical Association, 90(430):773--795, 1995.
[22]
B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16--22. ACM, 1999.
[23]
Z. Li and J. Liu. Constrained clustering by spectral kernel learning. In Computer Vision, 2009 IEEE 12th International Conference on, pages 421--427. IEEE, 2010.
[24]
Z. Lu and M. Carreira-Perpinán. Constrained spectral clustering through affinity propagation. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1--8. IEEE, 2008.
[25]
J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, page 14. California, USA, 1967.
[26]
P. Mahalanobis. On the generalized distance in statistics. In Proceedings of the National Institute of Science, Calcutta, volume 12, page 49, 1936.
[27]
P. McCullagh and J. Nelder. Generalized linear models. Chapman & Hall, CRC, 1999.
[28]
M. Meila. Comparing clusterings by the variation of information. In Learning theory and Kernel machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003: proceedings, page 173. Springer Verlag, 2003.
[29]
M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge Univ Pr, 2005.
[30]
S. Park, D. Pennock, O. Madani, N. Good, and D. DeCoste. Naïve filterbots for robust cold-start recommendations. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 699--705. ACM, 2006.
[31]
G. Salton and M. McGill. Introduction to modern information retrieval. New York, 1983.
[32]
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD workshop on text mining, volume 34, page 35. Citeseer, 2000.
[33]
C. Van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365--373, 1993.
[34]
K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 577--584. Citeseer, 2001.
[35]
K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207--244, 2009.
[36]
L. Zhang, D. Agarwal, and B. Chen. Generalizing matrix factorization through flexible regression priors. In Proceedings of the fifth ACM conference on Recommender systems, pages 13--20. ACM, 2011.
[37]
Y. Zhao and G. Karypis. Criterion functions for document clustering. Citeseer, 2005.

Cited By

View all
  • (2022)Taxonomy grooming algorithm ‐ An autodidactic domain specific dimensionality reduction approach for fast clustering of social media text dataConcurrency and Computation: Practice and Experience10.1002/cpe.683734:11Online publication date: Feb-2022
  • (2020)Entity Resolution in Dynamic Heterogeneous NetworksCompanion Proceedings of the Web Conference 202010.1145/3366424.3391264(662-668)Online publication date: 20-Apr-2020
  • (2020)node2bits: Compact Time- and Attribute-Aware Node Representations for User StitchingMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-46150-8_29(483-506)Online publication date: 30-Apr-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining
February 2012
792 pages
ISBN:9781450307475
DOI:10.1145/2124295
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 February 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bayes factor
  2. browser cookie churn
  3. clustering algorithms
  4. distributed computing
  5. similarity measure

Qualifiers

  • Research-article

Conference

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Taxonomy grooming algorithm ‐ An autodidactic domain specific dimensionality reduction approach for fast clustering of social media text dataConcurrency and Computation: Practice and Experience10.1002/cpe.683734:11Online publication date: Feb-2022
  • (2020)Entity Resolution in Dynamic Heterogeneous NetworksCompanion Proceedings of the Web Conference 202010.1145/3366424.3391264(662-668)Online publication date: 20-Apr-2020
  • (2020)node2bits: Compact Time- and Attribute-Aware Node Representations for User StitchingMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-46150-8_29(483-506)Online publication date: 30-Apr-2020
  • (2019)Experimentation Pitfalls to Avoid in A/B Testing for Online PersonalizationAdjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization10.1145/3314183.3323853(153-159)Online publication date: 6-Jun-2019
  • (2018)On Designing Optimal Data Purchasing Strategies for Online Ad AuctionsProceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3237383.3237927(1522-1530)Online publication date: 9-Jul-2018
  • (2017)Probabilistic Visitor Stitching on Cross-Device Web LogsProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052711(1581-1589)Online publication date: 3-Apr-2017
  • (2016)Improving Advertisement Recommendation by Enriching User Browser Cookie AttributesProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983374(2401-2404)Online publication date: 24-Oct-2016
  • (2016)People and CookiesProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2882984(1103-1111)Online publication date: 11-Apr-2016
  • (2016)Pitfalls of long-term online controlled experiments2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840744(1367-1376)Online publication date: Dec-2016
  • (2015)Personalizing Search on Shared DevicesProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767736(523-532)Online publication date: 9-Aug-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media