3.1.1 Overall Framework.
The architecture of the proposed FedCTR method is shown in Figure
2. In FedCTR, user behaviors on multiple online platforms are used to infer user interest for native ad CTR prediction, and the behavior data cannot leave its local platforms due to privacy concerns. To solve this problem, in FedCTR, we introduce a user server to coordinate different platforms, and the core idea is communicating the intermediate model results and their gradients among server and platforms rather than exchanging the raw user behavior data. We assume that the different platforms have aligned their user IDs via privacy-preserving entity resolution techniques [
11,
32]. Concretely, there are three major modules in the FedCTR framework. The first module is ad platform, which is used to predict the CTR scores of a set of native ads using a CTR predictor. It computes the probability score of a target user
u clicking a candidate ad
d based on their representations
\(\mathbf {u}\) and
\(\mathbf {d}\), which is formulated as
\(\hat{y}=f_{CTR}(\mathbf {u},\mathbf {d}; \Theta _C)\), where
\(\Theta _C\) represents the parameters of the CTR predictor. The representation of the ad
d is computed by an
ad model based on its ID and text, which is formulated as
\(\mathbf {d}=f_{ad}(ID, text; \Theta _D)\), where
\(\Theta _D\) denotes the model parameters of the
ad model. The second module consists of
K behavior platforms. Each platform has a
user model to learn local user representations based on its stored local user behaviors, such as search queries on the search engine platform and browsed webpages on the web browsing platform. For the
i-th behavior platform, the learning of local user representation is formulated as
\(\mathbf {u}_i=f_{user}^i(behaviors; \Theta _{U_i})\), where
\(\Theta _{U_i}\) denotes the parameter of the
user model maintained by this platform. The third module is a user server, which is responsible for coordinating multiple behavior platforms to learn local user embeddings according to the query of the ad platform and aggregating them into a unified user representation
\(\mathbf {u}\), which is formulated as
\(\mathbf {u}=f_{agg}(\mathbf {u}_1, \mathbf {u}_2,\ldots , \mathbf {u}_K; \Theta _A)\), where
\(\Theta _A\) is the aggregator model parameters. The aggregated user embeddings are further sent to the ad platform. Note that the user server is a third-party one that does not depend on behavior and ad platforms. We summary the characteristics of different participants in
FedCTR in Table
1. Next, we introduce each module in detail.
In the ad platform, assume there is a set of candidate ads, denoted as \(\mathcal {D}=[d_1, d_2,\ldots , d_P]\). Each ad has an ID, a title, and a description. There is an ad model in the ad platform that learns representations of ads from their ID, title, and description. When a user u visits the website where native ads are displayed, the ad platform is called to compute the personalized CTR scores of the candidate ads for this user. It sends the ID of this user to the user server to query her embeddings inferred from her behaviors on multiple platforms that encode her personalized interest information. When the ad platform receives the user embedding from the user server, it uses a CTR predictor to compute the ranking scores of the candidate ads based on user embeddings \(\mathbf {u}\) and embeddings of candidate ads \([\mathbf {d}_1, \mathbf {d}_2,\ldots , \mathbf {d}_P]\) using \(f_{CTR}(\cdot)\), which are denoted as \([\hat{y}_1, \hat{y}_2,\ldots , \hat{y}_P]\).
The user server is responsible for user embedding generation by coordinating multiple user behavior platforms. When it receives a user embedding query from the ad platform, it will use the user ID to query the
K behavior platforms to learn local user embeddings from local behaviors. After it receives the local user embeddings
\([\mathbf {u}_1, \mathbf {u}_2,\ldots , \mathbf {u}_K]\) from different behavior platforms, it uses an
Aggregator model with the function
\(f_{agg}(\cdot)\) to aggregate the
K local user embeddings into a unified one
\(\mathbf {u}\), which takes the relative importance of different kinds of behaviors into consideration. Since the unified user embedding
\(\mathbf {u}\) may still contain some private information of user behaviors, to further enhance user privacy protection, we apply DP technique [
7] to
\(\mathbf {u}\) by adding Laplacian noise with strength
\(\lambda _{DP}\) to
\(\mathbf {u}\). Then the user server sends the perturbed user embedding
\(\mathbf {u}\) to the ad platform for personalized CTR prediction.
The behavior platforms are responsible for learning user embeddings from their local behaviors. When a behavior platform receives the user embedding query of user
u, it will retrieve the behaviors of this user on this platform (e.g., search queries posted to the search engine platform), which are denoted as
\([d_1, d_2,\ldots , d_M]\), where
M is the number of behaviors. Then, it uses a neural
user model to learn the local user embedding
\(\mathbf {u}_i\) from these behaviors. The user embedding
\(\mathbf {u}_i\) can capture the user-interest information encoded in user behaviors. Since the local user embedding may also contain some private information of the user behaviors on the
i-th behavior platform, we apply LDP [
34] by adding Laplacian noise with strength
\(\lambda _{LDP}\) to each local user embedding to better protect user privacy. Then, the behavior platform uploads the perturbed local user embedding
\(\mathbf {u}_i\) to the user server for aggregation.
Next, we provide some discussions on the privacy protection of the proposed FedCTR method. First, in FedCTR the raw user behavior data never leaves the behavior platforms where it is stored, and only the user embeddings learned from multiple behaviors using neural user models are uploaded to user server. According to the data processing inequality [
26], the private information conveyed by these local user embeddings is usually much less than the raw user behaviors. Thus, the user privacy can be effectively protected. Second, the user server aggregates the local user embeddings from different platforms into a unified one and sends it to the ad platform. It is very difficult for the ad platform to infer a specific user behavior on a specific platform from this aggregated user embedding. Third, we apply the local differential privacy technique to the local user embeddings on each behavior platform, and apply the differential privacy technique to the aggregated user embedding on user server by adding Laplacian noise for perturbation, making it more difficult to infer the raw user behaviors from the local user embeddings and aggregated user embeddings. Thus, the proposed FedCTR method can well protect user privacy when utilizing user behaviors on different platforms to model user interest for CTR prediction.
3.1.2 Model Details.
In this section, we introduce the model details in the FedCTR framework, including the user model, ad model, aggregator, and CTR predictor.
User Model. User model is used to learn local user embeddings from local user behaviors on the behavior platforms. The user models on different behavior platforms share the same architecture but have different model parameters.
4 The architecture of user model is shown in Figure
3(a). It is based on the neural user model proposed in Reference [
41], which learns user embeddings from user behaviors in a hierarchical way. It first learns behavior representations from the texts in behaviors, such as the search query in online search behaviors and the webpage title in webpage browsing behaviors.
5 The behavior representation module first converts the text in behaviors into a sequence of word embeddings. In addition, following Reference [
6], we add position embedding to each word embedding to model word orders. Then the behavior representation module uses a multi-head self-attention network [
39] to learn contextual word representations by capturing the relatedness among the words in the text. Finally, the behavior representation module applies an attentive pooling network [
43] to these contextual word representations, which can compute the relative importance of these words and obtain a summarized text representation based on the word representations and their attention weights.
After learning the representations of behaviors, a user representation learning module is used to learn user embedding from these behavior embeddings. First, we add a position embedding vector to each behavior embedding vector to capture the sequential order of the behaviors. Then, we apply a multi-head self-attention network to learn contextual behavior representations by capturing the relatedness between the behaviors. Finally, we use an attentive pooling network [
43] to obtain a unified user embedding vector by summarizing these contextual behavior representations with their attention weights. The model parameters of the user model on the
i-th behavior platform are denoted as
\(\Theta _{U_i}\), and the learning of local user embedding on this platform can be formulated as
\(\mathbf {u}_i=f_{user}^i(behaviors; \Theta _{U_i})\).
Ad Model. The ad model is used to learn embeddings of ads from their IDs, titles, and descriptions. The architecture of the
ad model is illustrated in Figure
3(b). It is based on the ad encoder model proposed in Reference [
1] with small variants. Similar with the
user model, we use a combination of word embedding layer, multi-head self-attention layer and attentive pooling layer to learn the embeddings of titles and descriptions from the texts. In addition, we use an ID embedding layer and a dense layer to learn ad representation from ad ID. The final ad representation is learned from the ID embedding, title embedding and description embedding via an attention network [
1]. The model parameters of the ad model are denoted as
\(\Theta _D\), and the learning of ad embedding can be formulated as
\(\mathbf {d}=f_{ad}(ID, text; \Theta _D)\).
Aggregator. The aggregator model aims to aggregate the local user embeddings learned from different behavior platforms into a unified user embedding for CTR prediction. Since user behaviors on different platforms may have different informativeness for modeling user interest, we use an attention network [
43] to evaluate the importance of different local user embeddings when synthesizing them together. It takes the local user embeddings
\(\mathbf {U}=[\mathbf {u}_1, \mathbf {u}_2,\ldots , \mathbf {u}_K]\) from the
K platforms as the input, and learns the aggregated user embedding
\(\mathbf {u}\) from them via an attention network, which is formulated as follows:
where
\(\Theta _A\) is the parameters of the aggregator.
CTR Predictor. The CTR predictor aims to estimate the probability score of a user
u clicking a candidate ad
d based on their representations
\(\mathbf {u}\) and
\(\mathbf {d}\), which is formulated as
\(\hat{y} = f_{CTR}(\mathbf {u},\mathbf {d}; \Theta _C)\), where
\(\Theta _C\) is the model parameters of the CTR predictor. There are many options for the CTR prediction function
\(f_{CTR}(\cdot)\), such as dot product [
1], outer product [
12], and factorization machine [
10].