1. Introduction
Federated learning (FL) is a new paradigm of artificial intelligence in which machines collectively learn without compromising the privacy of owner data [
1]. Only model updates are shared with the central server for aggregation in FL. In contrast, the traditional machine learning (ML) approach requires all training data to be in one place. This can increase the risk of private and confidential data being leaked [
2]. In the Internet era, where many of the data are online, protecting personal information has become more challenging [
3,
4]. Hence, FL was proposed in [
5]. Decentralized clients train their data on their own devices, and only updated gradients are sent to the server for aggregation, and the model is updated and redistributed to all clients. Thus, the server does not know the actual data. The goal of FL is to allow clients to perform ML on decentralized data without compromising their privacy [
6].
Although FL promises data secrecy (because only updates are shared in the network), new security issues associated with this learning model may hinder its widespread acceptance [
7,
8]. Hence, it is crucial to identify the new risks and limitations involved. For example, attackers may know the previous model, allowing them to monitor traffic and extract personal information about the end-user or leak raw data during communication. These types of attacks on personal data constitute security breaches and compromise privacy. In
Figure 1, we show the likelihood of privacy breaches at various stages of FL. The
Table 1 shows the symbols which will be used throughout the paper.
FL uses a decentralized network that is prone to various attacks (e.g., user privacy, denial-of-service, inference, and data poisoning) [
9]. Moreover, if the network is compromised, an adversary can generate a large amount of traffic, disturbing the communication between a particular client and the server, perform a man-in-the-middle attack to prevent data transmission from client to server, or generate other restrictions that prevent the model from learning. If an attacker controls the model, they can infer the training and use it for personal advantage, attempt to bias the model to promote inferences in favor of themselves, or bias the model training, therefore damaging the model. Hence, security, especially privacy, is the most crucial issue in FL. If user or data privacy is not ensured, the server can collect sensitive information via an inference attack, and private data can be reconstructed by inverting the updated gradients (
https://www.kdnuggets.com/2020/08/breaking-privacy-federated-learning.html, accessed on 5 June 2024). Hence, the security of the data and the privacy of their source must be ensured, even in FL.
Three popular techniques are available to provide privacy in FL: secure aggregation (SA) (using masking), differential privacy (DP), and homomorphic encryption (HE) [
10,
11,
12]. SA is very costly (as explained later in this article); DP adds noise to the data, therefore reducing its accuracy; HE has a relatively high complexity, raising concerns regarding the feasibility of its real-time implementation. The properties of these schemes have been given in
Table 2, in which schemes are compared based on their advantages and disadvantages.
Group signature was first introduced to FL in 2021 to solve the issue of privacy preservation for users in the network [
13]. The main purpose of using a group signature (GS) is to preserve a sender’s privacy. The idea is to hide the individual signature so the receiver can verify the source’s identity without explicitly knowing it [
14]. The GS must be changed if membership is revoked or a member willingly leaves the group. Otherwise, the former member can misuse the GS for their advantage [
15]. The membership table is updated with a list of existing members and ex-members. Only authenticated members can sign a message using the GS; thus, the GS maintains its privacy with proper authentication. It achieves the main aim of FL: to preserve the user’s privacy. If the user’s identity is not revealed to the receiver, personal information linked to the user can be protected.
In this study, we aim to enhance the privacy of clients in the FL model using GS while also preserving updates using SA. In a conventional SA, clients share secret information with all other participating clients in the network, and the server aggregates them with client updates. All secret keys cancel each other out during the SA process. Each client must communicate with others selected for the FL process, significantly increasing the computation and communication costs. Signaling overheads also increase significantly, complicating the entire information collection process. As indicated in [
12], this basic SA model is not scalable.
Therefore, we adopted a GS-based SA technique, in which a secret key is shared with the participating clients by the admin of the group while sharing the GS. Hence, the most trusted party, the admin, must compute the secret key only once. This reduces the computational cost of each client and the communication costs of sending messages to every other client on the network. After adding the secret keys to their updates, clients send their updates by signing them with a GS. When the server receives a signed message, it verifies the GS and aggregates the secured message.
Group signature scheme has its own limitations. The size of the signature significantly depends upon the size of the group. If the group size increases, the GS size also increases. Verification and key management complexity also increase with the increase in group size. However, this increase in complexity is still very minimal when compared with the increase in complexity of the SA algorithm. Increase in one member in SA increases (
n + 1) communications. It also increases the computational overhead for each client to generate the secret key part for the new member. There are various factors explained later in this article that prove that SA is not scalable. Moreover, an increase in signature size depends upon the increase in group size. It does not depend on how many clients are selected for the FL process. However, an increase in members for the FL process affects the performance of SA significantly. Hence, GS performs better than existing algorithms even when group size increases. This is explained later in Performance analysis,
Section 5.
To validate the secrecy, confidentiality, and other security aspects of the method, we verify it using a mathematical logic tool: Burrows–Abadi–Needham (BAN) logic. This operates on a set of rules that we define and helps determine the likelihood of attacks in our algorithm. We also simulate our algorithm using the Automated Validation of Internet Security Protocols and Applications (AVISPA) [
16] and High-level Protocol Specification Language (HLPSL). AVISPA is a robust tool that uses HLPSL to specify algorithms and their security properties, and it integrates different backend tools to provide automated validation. We also compare our results with existing algorithms in terms of their cost, size, and security.
The contributions of this study are outlined below:
Protection of data and user privacy: We introduce a novel SA with GS to provide privacy to clients and their data. Contrary to data privacy or user privacy algorithms, this is the first work to address data security and user privacy simultaneously. The server cannot know the identity of the user because of GS, and it cannot guess the previous sequence of updates because of SA. Our work combines the best features of both technologies to produce an algorithm that completely mitigates inference attacks.
Improved computation and communication efficiency: Our algorithm has much lower computation and communication costs than existing algorithms. This is achieved by eliminating the need for peer-to-peer communication during key exchanges in the existing SA model. We provide a detailed analysis of the computation, communication, and signaling costs via comparisons with the existing algorithms.
Security proof of FL algorithm using BAN logic and AVISPA: We provide a formal mathematical proof using BAN logic and simulation analyses using AVISPA.
This paper is based on the assumption that the group admin is a trusted entity, like a certificate authority. This assumption is commonly accepted in group signature-based research. Clients share their keys with the admin to obtain the necessary group keys. The server also communicates with the admin to verify the latest group signature. Hence, the admin is an important entity that is assumed to be trusted and available during the entire FL process.
In
Section 1, we introduce our work, the motivation behind our work, and our contribution to preserving the privacy of users and data in FL.
Table 1 defines all the symbols used throughout the paper. The remainder of this paper is organized as follows:
Section 2 presents a literature survey regarding FL and the techniques proposed by various authors. The methodology and proposed algorithm are presented in
Section 3.
Section 4 presents an analysis of the proposed algorithm’s security; a security analysis and scope comparison are also presented in this section.
Section 5 compares the computation and communication costs of our algorithm with those of the existing algorithms.
Section 6 concludes the paper and provides the scope for future research in this area.
2. Related Work
Brendan et al. [
5] proposed
FedAvg as a method for aggregating the learned gradients collected from various systems. It primarily focuses on efficiency through the availability of clients and limited communications. However, it suffers from high computational and communication costs because of client-to-client communication. The major acknowledged work on privacy preservation in FL is the SA proposed in [
12]. This approach uses a shared secret key to aggregate client data and, therefore, prevent violations. In the first two rounds, the preparation phase and shared secret keys are generated. In the next round, the devices send cryptographically masked updates, and the server accumulates the sum thereof. All participating device updates are included. In the final round, the model reveals sufficient secret information from the shared secret key to allow the server to unmask the aggregate model. Only devices that have sent their data need to complete this round. This SA is slightly more costly than normal aggregation; however, its costs significantly increase with the number of users, which limits the maximum number of users. SA ensures that the server knows only that one of the users wrote the data without confirming which user wrote them. The main aim of this SA is to prevent users from revealing their identities to the server. The major disadvantages of this method are as follows:
The need for client-to-client communication increases the overall cost.
Members participating in masking must create and send everyone’s share of the secret information. If the minimum number of secrets required is not shared, unmasking cannot be performed.
If a certain number of members leave the FL process, unmasking is difficult and costly to achieve.
The literature survey below offers an introduction to the existing FL algorithms. Most of these are based on client-to-client secret-sharing methods to protect privacy. In contrast, we propose a group-based privacy preservation algorithm that is highly cost-effective. Grouping in FL is not a new process: various methods have been proposed, including stratified/cluster-based sampling groups [
17], fork-merge-consolidate [
18], the FedEval model [
1], and group knowledge transfer [
19]. However, most of these methods have categorized and collected data rather than provide security. Hence, we adopt the existing group-based approach and integrate it with the abovementioned SA to offer privacy efficiently. Other methods that offer modified SA and other security algorithms for the FL process are detailed below:
FedEval: A benchmark system with a comprehensive evaluation model for FL [1] (2020): FedEval is an easy-to-evaluate and lightweight federated benchmark system. Its designers customized the traditional message upload methods, server aggregation training method, and global-local model incorporation. The authors described FedSGD and FedAvg as the two most common methods in FL. FedSGD operates via large-batch synchronous stochastic gradient descent. FedAVG operates in small batches; it trains over various rounds and increases the number of clients in each round. Therefore, FedAvg is a more efficient method. However, it is less accurate than FedSGD and requires separate validation steps after uploading the actual updates, which increases the number of communications required. The algorithm protects the data from a gradient attack, using SA as its base; hence, it suffers from efficiency issues, as discussed at the beginning of this section.
Federated learning with differential privacy: Algorithms and performance analysis [11] (2020): This approach is based on increasing an optimal amount of the noise in the data to design a privacy-preserving FL architecture. The more noise is added, the more privacy is achieved. However, convergence performance is reduced with the increase in noise. Hence, there is always a trade-off between the convergence performance and privacy degree. The training quality is also affected by the size and distribution of the data. These are the major limitations of the approach, which leads us to think of the alternative of the technique. GS can be a better option because GS does not add any noise to the data, and so it does not affect the convergence performance of the algorithm.
Communication/computation-efficient SA for FL [20] (2021): This algorithm adopts the same four steps as the conventional scheme of SA: (1) advertising keys, (2) sharing keys, (3) mask input collection, and (4) unmasking. However, the key is shared with a subset of the clients (instead of all clients). This is more efficient than the conventional protocol because the involvement of fewer clients reduces the number of communication messages, and the overall computational cost is reduced using a smaller number of masks. However, this algorithm still requires client-to-client communication, making it less efficient when the network size is increased.
Breaking the quadratic aggregation barrier in secure FL [21] (2021): This approach uses secure multi-party computation based on cryptographic primitives, including secret sharing. Although the algorithm ensures privacy, it also suffers from high communication burdens when reconstructing secret information distributed over multiple clients.
Learning rate optimization for FL exploiting over-the-air computations [22] (2021): A dynamic learning rate has been proposed for efficient data aggregation in over-the-air computation. The proponents used parallel computing with multiple-input single-output and multiple-input multiple-outputs using ML to reduce the overall computation time. However, they only considered averaging and focused on minimizing the aggregate error. Overall, client-to-server communication and privacy were not discussed in detail.
Decentralized federated averaging: [23] (2015): The authors implemented FedAvg in a distributed manner because they assumed that attacking the central server would break the entire system’s privacy. They implemented the model on clients that were connected by an undirected graph. To further reduce the communication costs, they considered the quantized decentralized FedAvg with momentum. However, decentralized FL makes the algorithm less secure because the devices must trust one another.
FedV: Privacy-preserving FL over vertically partitioned data [24] (2021): To avoid existing client-to-client communications in horizontal FL, the authors proposed a vertical FL that used functional encryption to secure their data. This reduced the training time, and no issues arose when clients could not connect.
To the best of our knowledge, this study is the first algorithm that protects both client identification and data privacy. The server does not know which update belongs to which client, and at the same time it does not know the previous sequence of update of the current update. Hence, the server can neither predict the data nor the owner of the data, which mitigates the chances of inference attacks. Our algorithm effectively reduces the computational cost and network bandwidth while providing the necessary security features.
Section 5 presents a detailed comparison between the proposed algorithm and existing methods.
3. GSFedSec: GS-Based SA for Privacy Preservation in FL
Our GS-based protocol fundamentally solves the high computational cost problems of the traditional SA using the GS, as shown in
Figure 2. Each client sends data to a server and signs it with the GS without the need to exchange unmasking keys with the server and other clients. The server can check whether a given client is from a valid group, though it cannot identify individual members. Thus, the GS allows a client device to be validated at a lower cost without revealing its identity. Furthermore, if a group member exhibits suspicious behavior, it can be removed from the group. Then, the GS is updated for the remaining group members. Admin maintains a membership table and a revocation table at its side. Whenever any member joins the group, an entry is entered in the membership table. Whenever any member leaves or is revoked from the group, its entry is maintained in the revocation table. If the member has been revoked by the admin because of its malicious behavior, it is not allowed to join again. In this way, clients are encouraged to exhibit good behavior in the group; otherwise, they might be revoked from the group.
In addition, an adversary cannot target a particular model update because all are signed with the GS and secured with secret keys. Therefore, the GS-based technique offers several advantages over the traditional SA process, as follows:
GS is less costly than SA because the number of transmissions is reduced by removing the requirement for client-to-client communication.
In our algorithm, clients do not need to know each other because they do not need to communicate with each other. They only need to obtain an updated signature from the admin, and this cost is significantly small compared to that of sending messages to many nodes in the network.
The proposed protocol is improved from the short-GS algorithm presented in [
25]. Group members transmit data using their GS, hiding their identities. When the server receives the data, it sees the GS, verifies that it is from a valid member, and extracts it. The server still does not know the exact identity of the sender; hence, it cannot distinguish the specific data belonging to each sender. Subsequently, it aggregates all collected data and computes a new system model. Therefore, we designed a new integrated protocol that simultaneously and efficiently ensures data and source privacy.
3.1. System Setup
Let and be two multiplicative cyclic groups of prime order p with and as their generators, respectively. These groups are isomorphic. Suppose that is a bilinear map, where and the principles of bilinearity and non-degeneracy are adhered to:
Bilinearity: = , , and .
Non-degeneracy: = .
Furthermore, we assume that our protocol obeys the following assumptions:
Strong Diffie–Hellman Assumption: For an unknown x and y, the adversary has no advantage in learning , if g, , and are given (where g is the generator of the cyclic multiplicative group).
Decisional Diffie–Hellman (DDH) Problem: For an unknown x, y, and z, and given , , and , decide whether . This problem is not hard and can be solved in polynomial time.
Bilinear Diffie–Hellman Assumption: For an unknown x, y, and z , and given , , and , the adversary has no advantage in learning .
Decision Linear (DLIN) Assumption: Given , where x, y, and z are random elements, an adversary cannot distinguish between and a random element in polynomial time. The DLIN assumption is useful where the DDH assumption does not hold, especially in pairing-based cryptography.
Linear Encryption: This is an application of the DLIN assumption in which public keys are a triple of generators
u,
v, and
h ∈
, and private keys are the set of exponents of
such that
=
=
h [
26]. For encryption, we choose a and b randomly from
. The encrypted cipher text set is
. This can be denoted as
. To recover the message from encryption, the user computes
.
Our protocol is designed for dynamic client management. The FL server can choose any client it wants to participate. Client selection is dynamic, which means that for each iteration, the server can choose different clients from a group. Threshold-based cryptography can be used to maintain the dynamic participation of the clients in a group. The group keys are generated based on the number of clients in a group and a threshold value ‘t’ [
27]. Admin generates the group key and a separate private key (
for each client
i) for all group members. When any member joins or leaves the group, the group keys and threshold need to be modified accordingly. Admin distributes the keys to all members whenever there is a change in the group keys.
The messages in basic SA are padded with part of secret keys, which are later canceled out after aggregation. The time graph of this scheme is shown in
Figure 3. Our algorithm adopts this feature but adds a group signature to avoid large computation costs, which has been shown in
Figure 4. The step-by-step algorithm is defined as follows:
3.1.1. Key Generation
We adopt the short GS described in [
25] for key generation. The admin is the trusted entity, and it is responsible for two types of key generation: signature key and secret key. A signature key is generated for all members of the group. We have one isomorphic bilinear group pair
with the pair member generators
and
, respectively, and the property
. The admin selects
,
, and
; thus, it computes
=
=
h and
=
. Here,
is the public key of the admin, and
is the private key tuple, which is used to trace the signer in the case of any dispute.
is the private key of the admin. Using this
, a tuple
for the
i-th device in the group can be computed as follows:
, where
is selected randomly from
.
is the tuple for corresponding
ith device. Then, the admin publishes the group public key as
.
After the server sends a list of clients selected for FL to the admin, the admin generates a secret key and divides it into n parts, where n is the number of clients selected for FL. This division is such that when the parts are combined, everything is canceled out. Then, the admin sends one part of the secret key and to each respective client i.
3.1.2. Local Optimization
All client devices who want to participate in FL send a request to the server. The server chooses only a few clients for efficiency and asks them to send local updates. Those clients who were selected for FL train the local data using the global dataset and optimize it before sending it to the server. Please note that raw data are never sent to the server, and data are always trained locally so that they differ considerably from the original data. The data are trained using the formula given in Equation (
1):
where
is the fixed learning rate,
k is the
k-th client,
is the model parameter, and
is the local objective function of the client
k.
After conducting local training on the locally collected data and optimizing it, the client adds a part of the secret key in the update to keep it secure from the server’s access.
3.1.3. Signature Generation
After the edge device receives
and
from the admin, it can send messages by computing its GS. To formulate its GS, it must first compute the tracing, binding, and signing variables. To compute the tracing variables, we set
and two helper values (
), as given in Equations (2a) and (2b):
These tracing variables are used to detect the actual signer of the message. This tracing activity can be performed by the admin using the following formula given in Equation (
3):
This
has the corresponding value
(identity of the client) in the membership table, which is stored by the admin. Whenever required, the admin extracts the
from the table and identifies the actual signer. If the client does anything unwanted or shows malicious behavior, it can be removed by the group admin, and the group keys are updated accordingly [
26]. Hence, adding the traceability feature in the group signature always keeps the clients in check. Hence, it is important to add tracing variables to the signature to make the signer traceable by the admin.
Subsequently, the device chooses , , , , and as binding values; these are computed as follows:
Compute binding variables (Equations (4a)–(4c)):
Compute challenger (Equation (
5)):
Compute signing variables (Equations (6a) and (6b)):
Therefore, the GS can be given as formula given in Equation (
7):
3.1.4. Encryption
After signing the message, the sender device encrypts the message with the server’s public key so that no device in the network can read the message while it is in transit. Then, the sender merges it with the challenger into a single message and forwards it to the server. Let
be the server public key pair. Message
M can be encrypted as the formula given in Equation (
8):
The client adds the secret key part
received from the admin as formula given in Equation (
9):
Subsequently, a pair is sent to the server.
3.1.5. Verification
The server can identify the signature of the member by its group signature. If a member is registered as a group member and its membership is valid, it would sign with the latest group signature. The server can verify the latest group signature keys with the admin as given in step 4 of
Figure 4. If the signature of the client is not matching with the latest group signature, it will be rejected by the server. In this way, the server ensures the identity of the client. However, as we can see, the server knows the group of the client from its signature, not the actual signer. The server computes the challenger from the group keys received from the admin, and ciphertext variables and signing variables received from the client. The client also sends the hashed challenger. If the two challengers, one sent by the client and one computed by the server, are not matching, the signature is rejected.
When the server receives any message from the edge device, it first decrypts it, obtains message
, and verifies the signature. To verify the signature, the server first collects the latest GS from the admin. Then, it computes the verification variables as (formula given in Equations (10a)–(10c)):
Then, the server computes the challenger
c again as formula given in Equation (
11):
If , the signature is verified. Otherwise, the server rejects the message received. In this way, the server knows message M but does not know the signer of the message. The server receives the model updates from all participating members in this manner.
3.1.6. Secure Aggregation
After receiving updates from participating devices, the server collects all updates. These updates are padded with
. To cancel the masks
collected from all clients, the server needs an unmasking key, which the admin sends along with the latest GS. Then, the Server aggregates all updates to obtain a new updated model. The formula for SA can be expressed as given in Equation (
12):
where
is the fixed learning rate,
K is the number of clients,
C is the fraction of clients participating in the process,
is the model parameter,
is the current model, and
is the subsequent model. We define
n =
, where
n is the total number of samples in the dataset. Algorithm 1 outlines the procedures at the server, clients, and admin of the proposed algorithm.
Algorithm 1 GSFedSec. |
- 1:
initialize - 2:
for each iteration round do - 3:
= set of n clients - 4:
send to the admin - 5:
for each client do - 6:
- 7:
decrypts - 8:
verifies : Equations (10) and (11) - 9:
end for - 10:
- 11:
end for B. ClientUpdate(): - 12:
- 13:
split into batches of size B - 14:
for each client do - 15:
for batch do - 16:
- 17:
- 18:
end for - 19:
end for - 20:
return encrypted , to the server C. ClientSign(): - 21:
compute tracing variables: Equations (2a) and (2b) - 22:
compute binding variables: Equations (4a)–(4c) - 23:
compute challenger: Equation ( 5) - 24:
compute signing variables: Equations (6a) and (6b) - 25:
generate group signature: Equation ( 7) - 26:
encrypt model update: Equation ( 8) - 27:
add secret key : Equation ( 9) - 28:
return to ClientUpdate() D. AdminGenerates(): - 29:
generate - 30:
divide into n parts - 31:
for each client i from 1 to m do //m = total number of clients - 32:
- 33:
for each i from 1 to n do //n = number of clients selected for FL - 34:
- 35:
end for - 36:
end for - 37:
return
|