research-article

Open access

An Optimal Online Algorithm For Retrieving Heavily Perturbed Statistical Databases In The Low-Dimensional Querying Model

Authors:

Krzysztof Marcin Choromanski,

Afshin Rostamizadeh,

Umar SyedAuthors Info & Claims

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Pages 1381 - 1390

https://doi.org/10.1145/2806416.2806421

Published: 17 October 2015 Publication History

Abstract

We give the first Õ(1 over √ T)-error online algorithm for reconstructing noisy statistical databases, where T is the number of (online) sample queries received. The algorithm is optimal up to the poly(log(T)) factor in terms of the error and requires only O(log T) memory. It aims to learn a hidden database-vector w* Ε in ℜ ^D in order to accurately answer a stream of queries regarding the hidden database, which arrive in an online fashion from some unknown distribution D. We assume the distribution D is defined on the neighborhood of a low-dimensional manifold. The presented algorithm runs in O(dD)-time per query, where d is the dimensionality of the query-space. Contrary to the classical setting, there is no separate training set that is used by the algorithm to learn the database --- the stream on which the algorithm will be evaluated must also be used to learn the database-vector. The algorithm only has access to a binary oracle Ο that answers whether a particular linear function of the database-vector plus random noise is larger than a threshold, which is specified by the algorithm. We note that we allow for a significant O(D) amount of noise to be added while other works focused on the low noise o(√D)-setting. For a stream of T queries our algorithm achieves an average error Õ(1 over √T) by filtering out random noise, adapting threshold values given to the oracle based on its previous answers and, as a consequence, recovering with high precision a projection of a database-vector w* onto the manifold defining the query-space. Our algorithm may be also applied in the adversarial machine learning context to compromise machine learning engines by heavily exploiting the vulnerabilities of the systems that output only binary signal and in the presence of significant noise.

References

[1]

Cynthia Dwork. Differential privacy. In ICALP (2), pages 1--12, 2006.

Digital Library

[2]

Cynthia Dwork. Differential privacy in new settings. In Moses Charikar, editor, SODA, pages 174--183. SIAM, 2010.

Digital Library

[3]

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Leonard J. Schulman, editor, STOC, pages 715--724. ACM, 2010.

Digital Library

[4]

Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N. Rothblum, and Sergey Yekhanin. Pan-private streaming algorithms. In Andrew Chi-Chih Yao, editor, ICS, pages 66--80. Tsinghua University Press, 2010.

[5]

Kobbi Nissim, Rann Smorodinsky, and Moshe Tennenholtz. Approximately optimal mechanism design via differential privacy. CoRR, abs/1004.2888, 2010.

[6]

Krzysztof Choromanski and Tal Malkin. The power of the dinur-nissim algorithm: breaking privacy of statistical and graph databases. In PODS, pages 65--76, 2012.

Digital Library

[7]

Sergey Yekhanin. Private information retrieval. Commun. ACM, 53(4):68--73, 2010.

Digital Library

[8]

Cynthia Dwork, Frank McSherry, and Kunal Talwar. The price of privacy and the limits of LP decoding. In David S. Johnson and Uriel Feige, editors, STOC, pages 85--94. ACM, 2007.

Digital Library

[9]

Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In PODS, pages 202--210. ACM, 2003.

Digital Library

[10]

Cynthia Dwork and Sergey Yekhanin. New efficient attacks on statistical disclosure control mechanisms. In CRYPTO, pages 469--480, 2008.

Digital Library

[11]

Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In STOC, pages 537--546, 2008.

Digital Library

[12]

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373--1396, 2003.

Digital Library

[13]

Richard G. Baraniuk, Volkan Cevher, and Michael B. Wakin. Low-dimensional models for dimensionality reduction and signal recovery: A geometric perspective. pages 959--971, 2010.

[14]

Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. Adversarial classification. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, pages 99--108, New York, NY, USA, 2004. ACM.

Digital Library

[15]

Yevgeniy Vorobeychik and Bo Li. Optimal randomized classification in adversarial settings. In AAMAS, pages 485--492, 2014.

Digital Library

[16]

Kevin G. Jamieson, Maya R. Gupta, Eric Swanson, and Hyrum S. Anderson. Training a support vector machine to classify signals in a real environment given clean training data. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, 14-19 March 2010, Sheraton Dallas Hotel, Dallas, Texas, USA, pages 2214--2217, 2010.

[17]

S. Ross. Stochastic processes. Wiley, 1996.

Index Terms

An Optimal Online Algorithm For Retrieving Heavily Perturbed Statistical Databases In The Low-Dimensional Querying Model
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Query Evaluability in Statistical Databases

The evaluability of queries on a statistical database containing joinable tables connected by an intersection hypergraph is considered. A characterization of evaluable queries is given, which allows one to define polynomial-time procedures both for ...
Usability of Compromise-Free Statistical Databases
SSDBM '97: Proceedings of the Ninth International Conference on Scientific and Statistical Database Management

The usability of a statistical database is defined to be the ratio of the cardinality of the largest set of queries which can be answered without compromise to the total number of queries. In this paper, we present new results concerning the usability ...
Stochastic Protection of Confidential Information in Databases: A Hybrid of Data Perturbation and Query Restriction

Data perturbation and query restriction are two methods developed to protect confidential data in statistical databases. In the former, the data is systematically changed to yield answers to queries that are statistically similar to those that would ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

October 2015

1998 pages

ISBN:9781450337946

DOI:10.1145/2806416

General Chairs:
James Bailey
The University of Melbourne
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Charu C. Aggarwal
IBM
,
Maarten de Rijke
University of Amsterdam
,
Ravi Kumar
Google
,
Vanessa Murdock
Microsoft
,
Timos Sellis
RMIT University
,
Jeffrey Xu Yu
Chinese University of Hong Kong

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'15

Sponsor:

CIKM'15: 24th ACM International Conference on Information and Knowledge Management

October 18 - 23, 2015

Melbourne, Australia

Acceptance Rates

CIKM '15 Paper Acceptance Rate 165 of 646 submissions, 26%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
257
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)7

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents