Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3308558.3313474acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Privacy-Preserving Crowd-Sourcing of Web Searches with Private Data Donor

Published: 13 May 2019 Publication History

Abstract

Search engines play an important role on the Web, helping users find relevant resources and answers to their questions. At the same time, search logs can also be of great utility to researchers. For instance, a number of recent research efforts have relied on them to build prediction and inference models, for applications ranging from economics and marketing to public health surveillance. However, companies rarely release search logs, also due to the related privacy issues that ensue, as they are inherently hard to anonymize. As a result, it is very difficult for researchers to have access to search data, and even if they do, they are fully dependent on the company providing them. Aiming to overcome these issues, this paper presents Private Data Donor (PDD), a decentralized and private-by-design platform providing crowd-sourced Web searches to researchers. We build on a cryptographic protocol for privacy-preserving data aggregation, and address a few practical challenges to add reliability into the system with regards to users disconnecting or stopping using the platform. We discuss how PDD can be used to build a flu monitoring model, and evaluate the impact of the privacy-preserving layer on the quality of the results. Finally, we present the implementation of our platform, as a browser extension and a server, and report on a pilot deployment with real users.

References

[1]
I. E. Akkus, R. Chen, M. Hardt, P. Francis, and J. Gehrke. Non-tracking Web Analytics. In ACM CCS, 2012.
[2]
F. Bao, R. H. Deng, and H. Zhu. Variations of Diffie-Hellman problem. In ICICS, 2003.
[3]
N. Borisov, G. Danezis, and I. Goldberg. DP5: A Private Presence Service. In PoPETS, 2015.
[4]
C. Castelluccia, E. Mykletun, and G. Tsudik. Efficient Aggregation of encrypted data in Wireless Sensor Networks. In Mobiquitous, 2005.
[5]
T.-H. H. Chan, E. Shi, and D. Song. Privacy-preserving stream aggregation with fault tolerance. In Financial Cryptography, 2012.
[6]
R. Chen, I. E. Akkus, and P. Francis. SplitX: High-performance Private Analytics. In SIGCOMM, 2013.
[7]
R. Chen, A. Reznichenko, P. Francis, and J. Gehrke. Towards statistical queries over distributed private user data. In NSDI, 2012.
[8]
H. Choi and H. Varian. Predicting the Present with Google Trends. Economic Record, 88(s1), 2012.
[9]
:chutten. Two Days, or How Long Until the Data is In. Online at https://blog.mozilla.org/data/2017/09/19/two-days-or-how-long-until-the-data-is-in/, 2017.
[10]
H. Corrigan-Gibbs and D. Boneh. Prio: Private, Robust, and Scalable Computation of Aggregate Statistics. In NSDI, 2017.
[11]
W. Diffie and M. Hellman. New Directions in Cryptography. IEEE Transactions on Information Theory, 22(6), 1976.
[12]
C. Dwork. Differential Privacy. In ICALP, 2006.
[13]
U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In ACM CCS, 2014.
[14]
M. Ettredge, J. Gerdes, and G. Karuga. Using Web-based Search Data to Predict Macroeconomic Statistics. Communications of the ACM, 48(11), 2005.
[15]
L. Fan and H. Jin. A Practical Framework for Privacy-Preserving Data Analytics. In The World Wide Web Conference, 2015.
[16]
G. Fanti, V. Pihur, and lfar Erlingsson. Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries. In PoPETS, 2016.
[17]
A. Gervais, R. Shokri, A. Singla, S. Capkun, and V. Lenders. Quantifying Web-Search Privacy. In ACM CCS, 2014.
[18]
J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232), 2009.
[19]
S. Goel, J. M. Hofman, S. Lahaie, D. M. Pennock, and D. J. Watts. Predicting consumer behavior with Web search. Proceedings of the National Academy of Sciences, 107(41), 2010.
[20]
K. Kursawe, G. Danezis, and M. Kohlweiss. Privacy-friendly Aggregation for the Smart-grid. In PETS, 2011.
[21]
V. Lampos, A. C. Miller, S. Crossan, and C. Stefansen. Advances in Nowcasting Influenza-like Illness Rates using Search Query Logs. Scientific Reports, 5(12760), 2015.
[22]
V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox. Assessing the Impact of a Health Intervention via User-Generated Internet Content. Data Mining and Knowledge Discovery, 29(5), 2015.
[23]
V. Lampos, B. Zou, and I. J. Cox. Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance. In The World Wide Web Conference, 2017.
[24]
F. J. Massey Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46(253), 1951.
[25]
F. D. McSherry. Privacy Integrated Queries: An Extensible Platform for Privacy-preserving Data Analysis. In SIGMOD, 2009.
[26]
L. Melis, G. Danezis, and E. De Cristofaro. Efficient Private Statistics with Succinct Sketches. In NDSS, 2016.
[27]
P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: Privacy Preserving Data Analysis Made Easy. In SIGMOD '12, 2012.
[28]
Mozilla Labs. A Week in the Life of a Browser: Aggregated Data Sample. https://web.archive.org/web/20110711092459/ https://testpilot.mozillalabs.com/testcases/a-week-life/aggregated-data.html, 2011.
[29]
S. Palan and C. Schitter. Prolific.ac - A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 2018.
[30]
J. Paparrizos, R. W. White, and E. Horvitz. Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results. Journal of Oncology Practice, 12(8), 2016.
[31]
V. Pinchin. I'm Feeling Yucky :( Searching for symptoms on Google. Online at https://www.blog.google/products/search/im-feeling-yucky-searching-for-symptoms/, 2016.
[32]
P. M. Polgreen, Y. Chen, D. M. Pennock, F. D. Nelson, and R. A. Weinstein. Using Internet Searches for Influenza Surveillance. Clin. Infect. Dis., 47(11):1443-1448, 2008.
[33]
A. Pyrgelis, E. De Cristofaro, and G. J. Ross. Privacy-friendly mobility analytics using aggregate location data. In SIGSPATIAL, 2016.
[34]
A. Pyrgelis, C. Troncoso, and E. De Cristofaro. Knock Knock, Who's There? Membership Inference on Aggregate Location Data. In NDSS, 2018.
[35]
Y. Research. L18 - Anonymized Yahoo! Search Logs with Relevance Judgments. Online at https://webscope.sandbox.yahoo.com/catalog.php?datatype=l.
[36]
L. Soldaini and E. Yom-Tov. Inferring Individual Attributes from Search Engine Queries and Auxiliary Information. In The World Wide Web Conference, 2017.
[37]
D. Sullivan. Google now handles at least 2 trillion searches per year. Online at https://searchengineland.com/google-now-handles-2-999-trillion-searches-per-year-250247, 2016.
[38]
M. Wagner, V. Lampos, E. Yom-Tov, R. Pebody, and I. J. Cox. Estimating the Population Impact of a New Pediatric Influenza Vaccination Program in England Using Social Media Content. Journal of Medical Internet Research, 19(12), 2017.
[39]
R. White and E. Horvitz. Evaluation of the feasibility of screening patients for early signs of lung carcinoma in web search logs. JAMA Oncology, 3(3), 2017.
[40]
L. Wu and E. Brynjolfsson. The Future of Prediction: How Google Searches Foreshadow Housing Prices and Sales. University of Chicago Press, 2015.
[41]
S. Yang, M. Santillana, and S. C. Kou. Accurate Estimation of Influenza Epidemics using Google Search Data via ARGO. Proceedings of the National Academy of Sciences, 112(47), 2015.
[42]
E. Yom-Tov. Crowdsourced Health - How What You Do on the Internet Will Improve Medicine. MIT Press, 2016.
[43]
S. T. Zargar, J. Joshi, and D. Tipper. A Survey of Defense Mechanisms Against Distributed Denial of Service (DDoS) Flooding Attacks. IEEE Communications Surveys Tutorials, 15(4), 2013.
[44]
T. Zeller. AOL executive quits after posting of search data. https://web.archive.org/web/20061126162350/ http://www.iht.com/articles/2006/08/22/business/aol.php, 2006.
[45]
P. Zhao and B. Yu. On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7, 2006.
[46]
B. Zou, V. Lampos, and I. Cox. Multi-Task Learning Improves Disease Models from Web Search. In The World Wide Web Conference, 2018.
[47]
H. Zou and T. Hastie. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 2005.

Cited By

View all
  • (2022)OblivSend: Secure and Ephemeral File Sharing Services with Oblivious Expiration ControlInformation Security10.1007/978-3-031-22390-7_17(269-289)Online publication date: 18-Dec-2022
  • (2020)Privacy-preserving Crowd-sensed Trust Aggregation in the User-centeric Internet of People NetworksACM Transactions on Cyber-Physical Systems10.1145/33908605:1(1-24)Online publication date: 30-Dec-2020
  • (2020)Privacy-preserving AI Services Through Data DecentralizationProceedings of The Web Conference 202010.1145/3366423.3380106(190-200)Online publication date: 20-Apr-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)OblivSend: Secure and Ephemeral File Sharing Services with Oblivious Expiration ControlInformation Security10.1007/978-3-031-22390-7_17(269-289)Online publication date: 18-Dec-2022
  • (2020)Privacy-preserving Crowd-sensed Trust Aggregation in the User-centeric Internet of People NetworksACM Transactions on Cyber-Physical Systems10.1145/33908605:1(1-24)Online publication date: 30-Dec-2020
  • (2020)Privacy-preserving AI Services Through Data DecentralizationProceedings of The Web Conference 202010.1145/3366423.3380106(190-200)Online publication date: 20-Apr-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media