Nothing Special   »   [go: up one dir, main page]

 

Private Information Read-Update-Write with Applications to Distributed Learning

Thumbnail Image

Publication or External Link

Date

2023

Citation

Abstract

Data privacy has gained significant interest in information theory with the emergence of a wide range of data-driven technologies in the recent past. Data privacy is compromised primarily when the data origins (private users) download or upload information. This dissertation focuses on how information-theoretic privacy of the users' data can be guaranteed when downloading and uploading information, along the lines of private information retrieval (PIR) and private read-update-write (PRUW), which have applications in privacy-preserved distributed learning.

First, we consider the problem of semantic PIR, in which multiple non-colluding databases store a number of files, out of which a user desires to download one without revealing its index to any of the databases. Semantic PIR deviates from classical PIR by allowing the files to have arbitrary semantics such as different file sizes and arbitrary popularity profiles. As the main result of this work, we characterize the capacity of semantic PIR, with achievable schemes and a converse proof. We provide capacity results for semantic PIR with replicated databases, MDS coded databases and colluded databases.

As PIR deals with private \emph{reading}, we consider private \emph{writing} in the second work, which is an immediate conceptual extension of PIR. In the problem of private read-update-write (PRUW), a user downloads, updates and writes back a specific section of a storage system while guaranteeing information-theoretic privacy of the index of the section updated and the values of the updates. PRUW has applications in distributed learning such as privacy-preserved federated learning (FL) with sparsification and private federated submodel learning (FSL). In FSL, a machine learning model is divided into multiple submodels based on different types of data used for training, and a given user only downloads and updates the submodel relevant to the user's local data. To guarantee the privacy of the users participating in the FSL process, both the updating submodel index and the values of the updates must be kept private from the server that stores the model. This is achieved by PRUW, where the required submodel is downloaded in the \emph{reading phase} without revealing the submodel index to the databases (similar to PIR), and the updates are sent back to the databases in the \emph{writing phase} without revealing the values of the updates or the submodel index. We provide a basic PRUW scheme to perform private FSL that achieves the lowest known total communication cost of private FSL thus far. In this work, we introduce the concept of combining multiple parameter updates into a single bit in terms of a Lagrange polynomial in such a way that it can be privately decomposed into the respective individual updates and added to the relevant positions at the server.

Third, we consider the problem of private FSL with top $r$ sparsification, in which the user-server communications are significantly reduced by only sharing the most significant $r$ fractions of parameters and updates in the reading and writing phases. However, this introduces additional privacy requirements as the positions of the sparse updates/parameters leak information about the users' private data in addition to their values and the submodel index. To this end, we provide a PRUW scheme that performs top $r$ sparsification in FSL while guaranteeing the information-theoretic privacy of the updating submodel index, values of the sparse updates and the positions of the sparse updates/parameters using a permutation technique.

Fourth, we study random sparsification in FSL, in which the user only downloads and uploads a specific fraction of randomly selected parameters and updates to reduce the communications. The problem is formulated in terms of a rate-distortion characterization, where we derive the minimum achievable communication cost for a given amount of allowed distortion. We show that a linear rate-distortion relation is achievable while guaranteeing the information-theoretic privacy of the updating submodel index, the values of the sparse updates and the positions of the sparse updates/parameters.

Fifth, we extend the ideas of PRUW to FL with top $r$ sparsification. While the same permutation technique introduced in FSL with top $r$ sparsification is applicable to this setting, it incurs a significantly large storage cost for FL. To alleviate this, we modify the permutation technique in such a way that the storage cost is reduced at the expense of a certain amount of information leakage, using a model segmentation mechanism. In general, we provide the trade-off between the communication cost, storage cost and information leakage in private FL with top $r$ sparsification, along with achievable schemes with different properties.

In all of the above PRUW settings, we require multiple non-colluding databases to store the central model to guarantee information-theoretic privacy of the users' local data. In the sixth work, we consider the practical scenario of private FSL, where the databases storing the central model are allowed to have arbitrary storage constraints. As the main result of this work, we develop a PRUW scheme and a storage mechanism for FSL that efficiently utilize the available space in each database to store the submodel parameters in such a way that the total communication cost is minimized while guaranteeing information-theoretic privacy of the updating submodel index and the values of the updates. The proposed storage mechanism is a hybrid of MDS coded storage and divided storage, which focuses on finding the optimum MDS codes and fractions of submodels stored in each database for any given set of homogeneous or heterogeneous storage constraints.

Seventh, we go beyond privacy and consider deception in information retrieval. We introduce the problem of deceptive information retrieval (DIR) which is a conceptual extension of PIR. In DIR, the user downloads a required file out of multiple files stored in a system of databases and reveals information about a different file as what was required, to the databases. In other words, the user deceives the databases by making their prediction on the user-required file index incorrect with high probability. We propose a scheme to perform DIR that achieves a given required level of deception, with the goal of minimizing the download cost. The proposed scheme incurs higher download costs compared to PIR for positive levels of deception, and achieves the PIR capacity when the level of deception specified is zero.

Notes

Rights