research-article

Open access

FedCTR: Federated Native Ad CTR Prediction with Cross-platform User Behavior Data

Authors:

Xing XieAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 13, Issue 4

Article No.: 62, Pages 1 - 19

https://doi.org/10.1145/3506715

Published: 29 June 2022 Publication History

All formats PDF

Abstract

Native ad is a popular type of online advertisement that has similar forms with the native content displayed on websites. Native ad click-through rate (CTR) prediction is useful for improving user experience and platform revenue. However, it is challenging due to the lack of explicit user intent, and user behaviors on the platform with native ads may be insufficient to infer users’ interest in ads. Fortunately, user behaviors exist on many online platforms that can provide complementary information for user-interest mining. Thus, leveraging multi-platform user behaviors is useful for native ad CTR prediction. However, user behaviors are highly privacy-sensitive, and the behavior data on different platforms cannot be directly aggregated due to user privacy concerns and data protection regulations. Existing CTR prediction methods usually require centralized storage of user behavior data for user modeling, which cannot be directly applied to the CTR prediction task with multi-platform user behaviors. In this article, we propose a federated native ad CTR prediction method named FedCTR, which can learn user-interest representations from cross-platform user behaviors in a privacy-preserving way. On each platform a local user model learns user embeddings from the local user behaviors on that platform. The local user embeddings from different platforms are uploaded to a server for aggregation, and the aggregated ones are sent to the ad platform for CTR prediction. Besides, we apply local differential privacy and differential privacy to the local and aggregated user embeddings, respectively, for better privacy protection. Moreover, we propose a federated framework for collaborative model training with distributed models and user behaviors. Extensive experiments on real-world dataset show that FedCTR can effectively leverage multi-platform user behaviors for native ad CTR prediction in a privacy-preserving manner.

1 Introduction

Native ad is a popular form of online advertisements that has similar styles and functions with the native content displayed on online platforms, such as news and video websites [25]. An illustrative example of native ads on the homepage of MSN News¹ is shown in Figure 1. We can see that except for a sign of “Ad,” the appearance of the embedded native ads is very similar with the listed news articles. Due to the reduction of ad recognition, native ads can better attract users’ attentions, and have gained increasing popularity in many online platforms [40]. Therefore, accurate click-through rate (CTR) prediction of native ads is an important task for online advertising, which can improve users’ experience by recommending their interested ads as well as improve the revenue of online websites by attracting more ad clicks [4].

Fig. 1.

Although the CTR prediction for search ads and display ads has been widely studied [17, 44], the research on native ads is very limited [1]. Compared with search ads that are displayed based on users’ intents inferred from their search queries, there is no explicit user intent for native ads, making it more difficult to predict the click probability. By contrast, display ads are usually presented on e-commerce platforms, and their CTR prediction can take the advantage of various user behavior records such as browsing, preferring and purchasing for interest modeling. However, users’ behaviors on the platform where native ads are displayed usually cannot provide sufficient information for inferring their interest in ads, such as the news reading behaviors at online news websites.

Fortunately, users’ online behaviors exist in many different platforms, and they can provide various clues to infer user interest in different aspects. For example, in Figure 1, the search behaviors on search engine platform (e.g., “mortgage rate”) indicate that this user may have interest in buying a new house or applying for housing loan, and she may click the native ad with title “How to Pay Off Your House ASAP.” In addition, from her webpage browsing behaviors, we can infer that this user may be interested in cars, since she browsed a webpage with title “2021 Chevrolet Cars,” and it is appropriate to display the native ad “See the latest new models from Chevrolet” to her. Thus, incorporating user behaviors on multiple platforms is useful for modeling user interest more accurately and can benefit native ad CTR prediction, which has been validated by existing studies [1]. For example, An et al. [1] found that combining users’ searching behaviors and webpage browsing behaviors can achieve better performance on native ad CTR prediction than using single kind of user behaviors. However, online user behaviors such as the queries they searched and the webpages they browsed are highly privacy-sensitive. Thus, the behavior data on different platforms cannot be directly aggregated into a single server or exchanged among different platforms due to users’ privacy concerns and the user data protection regulations like GDPR.² Most existing CTR prediction methods rely on centralized storage of user behavior data, making them difficult to be applied to the native ad CTR prediction task with multi-platform user behaviors.

In this article, we propose a federated native ad CTR prediction method named FedCTR, which can incorporate the user behavior data on different platforms to model user interest in a privacy-preserving way without storing them centrally. Specifically, we learn unified user representations from different platforms in a federated way. On each platform that participates in user representation learning, a local user model is used to learn the local user embeddings from the user behavior logs on that platform. The learning of user embeddings on different platforms is coordinated by a user server, and these local embeddings are uploaded to this server for aggregation. The aggregated unified user embeddings are further sent to the ad platform for CTR prediction. Thus, the FedCTR method can exploit the user information on different platforms for native ad CTR prediction. Since the raw user behavior logs never leave the local platform and only user embeddings are uploaded, user privacy can be protected to a certain extent in the FedCTR method. In addition, we apply local differential privacy (LDP) [34] and differential privacy (DP) [7] techniques to the local and aggregated user embeddings, respectively, before sending them, which can achieve better privacy protection at a small cost of CTR prediction performance. Since in FedCTR the user behavior data and model parameters are located on different platforms, it is a non-trivial task to train the model for FedCTR without exchanging the privacy-sensitive user behavior data across different platforms. Thus, we propose a privacy-preserving framework based on federated learning to address this issue. In our framework, only user embeddings and intermediate gradients are communicated among different platforms, thus the risk of user privacy leakage can be effectively reduced at the model training stage. Extensive experiments are conducted on a real-world dataset collected from the user logs on the native ads in the MSN News website.³ The results validate that our approach can improve the performance of native ad CTR prediction via exploiting the multi-platform user behaviors and meanwhile effectively protect user privacy.

The main contributions of this work include:

•

We propose a privacy-preserving native ad CTR prediction method named FedCTR, which can exploit multi-platform user behaviors for user-interest modeling without centralized storage of them.

•

We design a federated model training framework to train models for FedCTR where user behavior data and model parameters are distributed on different platforms.

•

We conduct extensive experiments on a real-world dataset to verify the effectiveness of the proposed method in both CTR prediction and privacy protection.

2 Related Work

2.1 CTR Prediction

Native ad is a special kind of display ad that is similar to the native content displayed in online websites [25]. CTR prediction for display ads has been extensively studied [5, 10, 19, 21, 23, 37, 44, 45]. Different from search ads where the search query triggering ad impressions can provide clear user intent, in display ads there is no explicit user intent [45]. Thus, it is very important for display ad CTR prediction to model user interest from users’ historical behaviors. Many existing CTR prediction methods rely on handcrafted features to represent users and ads, and they focus on capturing the interactions between these features to estimate the relevance between users and ads [3, 5, 10, 15, 20, 28, 37, 45]. For example, Cheng et al. [5] proposed a Wide&Deep model that integrates a wide linear channel with cross-product and a deep neural network channel to capture feature interactions for CTR prediction. They represented users’ interest with their demographic features, device features and the features extracted from their historical impressions. Guo et al. [10] proposed a DeepFM model that combines factorization machines and deep neural networks to model feature interactions. They used ID and category features to represent ads, and used the feature collections of historical clicked items to represent users. However, the design of handcrafted features used in these methods usually requires massive domain knowledge, and handcrafted features may not be optimal in modeling user interest. There are also several methods for display ads CTR prediction that use deep learning techniques to learn user-interest representations from their behaviors on e-commerce platforms [9, 18, 31]. For example, Zhou et al. [45] proposed a deep interest network (DIN) that learns representations of users from the items they have interacted with on the e-commerce platform based on the relatedness between those items and the candidate ads. In Reference [44], an improved version of DIN named deep interest evolution network (DIEN) was proposed, which models users from their historical behaviors on items via a GRU network with attentional update gate. These deep learning-based CTR prediction methods for display ads on e-commerce platform usually rely on the user behaviors on the same platform to model user interest. However, besides the e-commerce platform, native ads are widely displayed on many other online platforms such as news websites [1]. The user behaviors on these platforms may have insufficient clues to infer user interest in ads. Thus, these CTR prediction methods designed for display ads on e-commerce platform may be not optimal for native ads.

There are only a few studies for native ad CTR prediction [1, 29]. For example, An et al. [1] proposed to model user interest for native ad CTR prediction from their search queries and browsed webpages. These studies found that incorporating multi-platform user behaviors can model user interest more accurately than using single-platform user behaviors. These methods usually require centralized storage of multi-platform user behaviors. However, user behavior data is highly privacy-sensitive, and cannot be directly aggregated across platforms due to privacy concerns and user data protection regulations like GDPR. Different from these methods, our proposed FedCTR method can exploit user behaviors on different platforms for native ad CTR prediction via federated learning, which can eliminate the centralized storage of user behavior data and achieve better user privacy protection.

2.2 Federated Learning

The learning of user representation in our proposed FedCTR method with multi-platform user behavior data is based on federated learning [26]. Federated learning is a recently proposed machine learning technique, which can learn a shared model from the private data of massive users in a privacy-preserving way [8, 11, 22, 27, 42]. Instead of directly uploading the private user data to a central server for model training, in federated learning the user data is locally stored on different user devices such as smartphones and personal computers. Each user device has a copy of the local model and computes the model updates based on local user interactions. The model updates from a large number of users are uploaded to the server and aggregated into a single one for global model update [26]. Then the new global model is delivered to user devices for local model update, and this process iterates for multiple rounds. Since the model updates contain much less information than the raw user data, federated learning can provide an effective way to exploit the private data of different users and protect their privacy at the same time [26]. Based on the idea of federated learning, Jiang et al. [14] proposed a federated topic modeling approach to train topic models from the corpus owned by different parties. In these federated learning methods, the samples for model training are distributed on different clients, and each client shares the same feature space, aka horizontal federated learning [42]. In horizontal federated learning, the models are usually fully or partially shared across different clients, and the training sample sets on different clients are usually different. The server and clients communicate the updates of the local models. However, different from horizontal federated learning, in the task of native ad CTR prediction with multi-platform user behaviors, the user behavior data of each sample is distributed on different platforms. These platforms may contain the same sample but they can only see part of the user feature space, aka vertical federated learning [42]. In vertical federated learning, the models on different clients are usually independent, and the feature sets of each sample are kept by different clients in a decentralized manner. Since the features on different clients cannot be directly aggregated, the clients need to exchange the intermediate results inferred by local models as well as their gradients to synthesize the final prediction results and train models, respectively. Note that different from horizontal federated learning, in vertical federated learning the clients are usually service providing platforms that store certain kinds of data of many users rather than individual user devices. Thus, the problem studied in this work is quite different from the existing federated learning methods. To our best knowledge, this is the first work that applies federated learning to privacy-preserving native ad CTR prediction with multi-platform user behaviors, and FedCTR is the first approach to protect the privacy information in user embeddings at both training and inference stages.

3 Methodology

In this section, we first introduce the details of our federated native ad CTR prediction method (FedCTR). Then, we introduce the framework to train the FedCTR model where behavior data and model parameters are distributed on different platforms.

3.1 FedCTR for Native Ad CTR Prediction

3.1.1 Overall Framework.

The architecture of the proposed FedCTR method is shown in Figure 2. In FedCTR, user behaviors on multiple online platforms are used to infer user interest for native ad CTR prediction, and the behavior data cannot leave its local platforms due to privacy concerns. To solve this problem, in FedCTR, we introduce a user server to coordinate different platforms, and the core idea is communicating the intermediate model results and their gradients among server and platforms rather than exchanging the raw user behavior data. We assume that the different platforms have aligned their user IDs via privacy-preserving entity resolution techniques [11, 32]. Concretely, there are three major modules in the FedCTR framework. The first module is ad platform, which is used to predict the CTR scores of a set of native ads using a CTR predictor. It computes the probability score of a target user u clicking a candidate ad d based on their representations \(\mathbf {u}\) and \(\mathbf {d}\), which is formulated as \(\hat{y}=f_{CTR}(\mathbf {u},\mathbf {d}; \Theta _C)\), where \(\Theta _C\) represents the parameters of the CTR predictor. The representation of the ad d is computed by an ad model based on its ID and text, which is formulated as \(\mathbf {d}=f_{ad}(ID, text; \Theta _D)\), where \(\Theta _D\) denotes the model parameters of the ad model. The second module consists of K behavior platforms. Each platform has a user model to learn local user representations based on its stored local user behaviors, such as search queries on the search engine platform and browsed webpages on the web browsing platform. For the i-th behavior platform, the learning of local user representation is formulated as \(\mathbf {u}_i=f_{user}^i(behaviors; \Theta _{U_i})\), where \(\Theta _{U_i}\) denotes the parameter of the user model maintained by this platform. The third module is a user server, which is responsible for coordinating multiple behavior platforms to learn local user embeddings according to the query of the ad platform and aggregating them into a unified user representation \(\mathbf {u}\), which is formulated as \(\mathbf {u}=f_{agg}(\mathbf {u}_1, \mathbf {u}_2,\ldots , \mathbf {u}_K; \Theta _A)\), where \(\Theta _A\) is the aggregator model parameters. The aggregated user embeddings are further sent to the ad platform. Note that the user server is a third-party one that does not depend on behavior and ad platforms. We summary the characteristics of different participants in FedCTR in Table 1. Next, we introduce each module in detail.

Fig. 2.

Table 1.

Participants	Type	Data	Communication
Behavior platforms	Different online platforms (e.g., search engine and ad display website)	User behaviors	Local user embedding (to user server)
Ad platform	Ad display website	Candidate ads, click labels	Prediction query (to user server)
User server	Third-party server to coordinate ad and behavior platforms (not deployed on them)	—	Global user embedding (to ad platform)

Table 1. A Summary of Different Participants for Federated Native Ad CTR Prediction

In the ad platform, assume there is a set of candidate ads, denoted as \(\mathcal {D}=[d_1, d_2,\ldots , d_P]\). Each ad has an ID, a title, and a description. There is an ad model in the ad platform that learns representations of ads from their ID, title, and description. When a user u visits the website where native ads are displayed, the ad platform is called to compute the personalized CTR scores of the candidate ads for this user. It sends the ID of this user to the user server to query her embeddings inferred from her behaviors on multiple platforms that encode her personalized interest information. When the ad platform receives the user embedding from the user server, it uses a CTR predictor to compute the ranking scores of the candidate ads based on user embeddings \(\mathbf {u}\) and embeddings of candidate ads \([\mathbf {d}_1, \mathbf {d}_2,\ldots , \mathbf {d}_P]\) using \(f_{CTR}(\cdot)\), which are denoted as \([\hat{y}_1, \hat{y}_2,\ldots , \hat{y}_P]\).

The user server is responsible for user embedding generation by coordinating multiple user behavior platforms. When it receives a user embedding query from the ad platform, it will use the user ID to query the K behavior platforms to learn local user embeddings from local behaviors. After it receives the local user embeddings \([\mathbf {u}_1, \mathbf {u}_2,\ldots , \mathbf {u}_K]\) from different behavior platforms, it uses an Aggregator model with the function \(f_{agg}(\cdot)\) to aggregate the K local user embeddings into a unified one \(\mathbf {u}\), which takes the relative importance of different kinds of behaviors into consideration. Since the unified user embedding \(\mathbf {u}\) may still contain some private information of user behaviors, to further enhance user privacy protection, we apply DP technique [7] to \(\mathbf {u}\) by adding Laplacian noise with strength \(\lambda _{DP}\) to \(\mathbf {u}\). Then the user server sends the perturbed user embedding \(\mathbf {u}\) to the ad platform for personalized CTR prediction.

The behavior platforms are responsible for learning user embeddings from their local behaviors. When a behavior platform receives the user embedding query of user u, it will retrieve the behaviors of this user on this platform (e.g., search queries posted to the search engine platform), which are denoted as \([d_1, d_2,\ldots , d_M]\), where M is the number of behaviors. Then, it uses a neural user model to learn the local user embedding \(\mathbf {u}_i\) from these behaviors. The user embedding \(\mathbf {u}_i\) can capture the user-interest information encoded in user behaviors. Since the local user embedding may also contain some private information of the user behaviors on the i-th behavior platform, we apply LDP [34] by adding Laplacian noise with strength \(\lambda _{LDP}\) to each local user embedding to better protect user privacy. Then, the behavior platform uploads the perturbed local user embedding \(\mathbf {u}_i\) to the user server for aggregation.

Next, we provide some discussions on the privacy protection of the proposed FedCTR method. First, in FedCTR the raw user behavior data never leaves the behavior platforms where it is stored, and only the user embeddings learned from multiple behaviors using neural user models are uploaded to user server. According to the data processing inequality [26], the private information conveyed by these local user embeddings is usually much less than the raw user behaviors. Thus, the user privacy can be effectively protected. Second, the user server aggregates the local user embeddings from different platforms into a unified one and sends it to the ad platform. It is very difficult for the ad platform to infer a specific user behavior on a specific platform from this aggregated user embedding. Third, we apply the local differential privacy technique to the local user embeddings on each behavior platform, and apply the differential privacy technique to the aggregated user embedding on user server by adding Laplacian noise for perturbation, making it more difficult to infer the raw user behaviors from the local user embeddings and aggregated user embeddings. Thus, the proposed FedCTR method can well protect user privacy when utilizing user behaviors on different platforms to model user interest for CTR prediction.

3.1.2 Model Details.

In this section, we introduce the model details in the FedCTR framework, including the user model, ad model, aggregator, and CTR predictor.

User Model. User model is used to learn local user embeddings from local user behaviors on the behavior platforms. The user models on different behavior platforms share the same architecture but have different model parameters.⁴ The architecture of user model is shown in Figure 3(a). It is based on the neural user model proposed in Reference [41], which learns user embeddings from user behaviors in a hierarchical way. It first learns behavior representations from the texts in behaviors, such as the search query in online search behaviors and the webpage title in webpage browsing behaviors.⁵ The behavior representation module first converts the text in behaviors into a sequence of word embeddings. In addition, following Reference [6], we add position embedding to each word embedding to model word orders. Then the behavior representation module uses a multi-head self-attention network [39] to learn contextual word representations by capturing the relatedness among the words in the text. Finally, the behavior representation module applies an attentive pooling network [43] to these contextual word representations, which can compute the relative importance of these words and obtain a summarized text representation based on the word representations and their attention weights.

Fig. 3.

After learning the representations of behaviors, a user representation learning module is used to learn user embedding from these behavior embeddings. First, we add a position embedding vector to each behavior embedding vector to capture the sequential order of the behaviors. Then, we apply a multi-head self-attention network to learn contextual behavior representations by capturing the relatedness between the behaviors. Finally, we use an attentive pooling network [43] to obtain a unified user embedding vector by summarizing these contextual behavior representations with their attention weights. The model parameters of the user model on the i-th behavior platform are denoted as \(\Theta _{U_i}\), and the learning of local user embedding on this platform can be formulated as \(\mathbf {u}_i=f_{user}^i(behaviors; \Theta _{U_i})\).

Ad Model. The ad model is used to learn embeddings of ads from their IDs, titles, and descriptions. The architecture of the ad model is illustrated in Figure 3(b). It is based on the ad encoder model proposed in Reference [1] with small variants. Similar with the user model, we use a combination of word embedding layer, multi-head self-attention layer and attentive pooling layer to learn the embeddings of titles and descriptions from the texts. In addition, we use an ID embedding layer and a dense layer to learn ad representation from ad ID. The final ad representation is learned from the ID embedding, title embedding and description embedding via an attention network [1]. The model parameters of the ad model are denoted as \(\Theta _D\), and the learning of ad embedding can be formulated as \(\mathbf {d}=f_{ad}(ID, text; \Theta _D)\).

Aggregator. The aggregator model aims to aggregate the local user embeddings learned from different behavior platforms into a unified user embedding for CTR prediction. Since user behaviors on different platforms may have different informativeness for modeling user interest, we use an attention network [43] to evaluate the importance of different local user embeddings when synthesizing them together. It takes the local user embeddings \(\mathbf {U}=[\mathbf {u}_1, \mathbf {u}_2,\ldots , \mathbf {u}_K]\) from the K platforms as the input, and learns the aggregated user embedding \(\mathbf {u}\) from them via an attention network, which is formulated as follows:

\begin{equation} \mathbf {u}=f_{agg}(\mathbf {U}; \Theta _A)=\mathbf {U}[\mathrm{softmax}(\mathbf {U}^T\Theta _A)], \end{equation}

(1)

where \(\Theta _A\) is the parameters of the aggregator.

CTR Predictor. The CTR predictor aims to estimate the probability score of a user u clicking a candidate ad d based on their representations \(\mathbf {u}\) and \(\mathbf {d}\), which is formulated as \(\hat{y} = f_{CTR}(\mathbf {u},\mathbf {d}; \Theta _C)\), where \(\Theta _C\) is the model parameters of the CTR predictor. There are many options for the CTR prediction function \(f_{CTR}(\cdot)\), such as dot product [1], outer product [12], and factorization machine [10].

3.2 Federated Model Training

Existing CTR prediction methods usually train the models in a centralized way, where both the training data and the model parameters are located in the same place. In the proposed FedCTR method for native ad CTR prediction, the training data is distributed on multiple platforms. For example, the ad information and the users’ click and non-click behaviors on ads that can serve as labels for model training are located in the ad platform, while users’ behaviors on many online platforms are located in other platforms, which cannot be centralized due to privacy constraints. In addition, FedCTR contains multiple models such as user models, ad model, and aggregator model, which are distributed on different platforms. Thus, it is a non-trivial task to train the models of FedCTR without violating the privacy protection requirement. Motivate by Reference [26], in this section, we present a privacy-preserving framework to train the models for FedCTR where each platform learns the model on it in a federated way, and only the gradients of user embeddings (rather than raw user behaviors or models) are communicated across different platforms. The framework for model training is shown in Figure 4, and the data communications among different participants are summarized in Table 2. The details of model training are introduced as follows.

Fig. 4.

Table 2.

Participants	Communication
Behavior platforms	local user embedding (to user server)
Ad platform	prediction query (to user server), gradients of global user embedding (to user server)
User server	global user embedding (to ad platform), gradients of local user embeddings (to behavior platforms)

Table 2. A Summary of Different Participants for Federated Model Training

At the model training stage, for a randomly selected user behavior on the ad platform denoted as \((u, d, y, t),\) which means at timestamp t an ad d is displayed to user u and her click behavior is y (1 for click and 0 for non-click), we first use the FedCTR framework to learn the local user embeddings \([\mathbf {u}_1, \mathbf {u}_2,\ldots , \mathbf {u}_K]\) and the aggregated user embedding \(\mathbf {u}\) at timestamp t using the current user and aggregator models (steps 1–3). We also use the current ad model to learn an embedding \(\mathbf {d}\) of this ad. Then, we use the current CTR predictor model to compute the predicted click probability score \(\hat{y}\). By comparing \(\hat{y}\) with y, we can compute the loss of the current FedCTR models on this training sample (step 4). In our model training framework cross-entropy loss is used, and loss of this sample can be formulated as

\begin{equation} \mathcal {L}(u,d; \Theta _D, \Theta _C, \Theta _A, \Theta _U)=-y\log (\hat{y})-(1-y)\log (1-\hat{y}), \end{equation}

(2)

where \(\Theta _U = [\Theta _{U_1},\Theta _{U_2},\ldots ,\Theta _{U_K}]\) is the parameter set of user models on different platforms.

Then, we compute the model gradients for model update (steps 5–9). According to the loss \(\mathcal {L}\), we compute the gradient of \(\hat{y}\), denoted as \(\nabla \hat{y}\). We input \(\nabla \hat{y}\) to the CTR predictor. Based on \(\nabla \hat{y}\) and the CTR prediction function \(\hat{y}=f_{CTR}(\mathbf {u},\mathbf {d};\Theta _C)\), we can compute the gradient of parameters \(\Theta _C\) in CTR predictor (denoted as \(\nabla \Theta _C\)), the gradient of user embedding \(\mathbf {u}\) (denoted as \(\nabla \mathbf {u}\)) and the gradient of ad embedding \(\mathbf {d}\) (denoted as \(\nabla \mathbf {d}\)). The model parameters of the CTR predictor \(\Theta _C\) can be updated using \(\nabla \Theta _C\) following the SGD [2] algorithm, i.e., \(\Theta _C = \Theta _C - \eta \nabla \Theta _C\), where \(\eta\) is the learning rate.

The gradient of ad embedding \(\nabla \mathbf {d}\) is then sent to the ad model. Based on \(\nabla \mathbf {d}\) and the function \(\mathbf {d}=f_{ad}(ID, text; \Theta _D)\) that summarizes the neural architecture of ad model, we can compute the gradient of the ad model, denoted as \(\nabla \Theta _D\). We use \(\nabla \Theta _D\) to update the model parameters of ad model \(\Theta _D\), i.e., \(\Theta _D = \Theta _D - \eta \nabla \Theta _D\).

The user embedding gradient \(\nabla \mathbf {u}\) is distributed to the aggregator in the user server. Based on the aggregator model function \(\mathbf {u}=f_{agg}(\mathbf {U}; \Theta _A)\) and \(\nabla \mathbf {u}\), we compute the gradients of the aggregator model (denoted as \(\nabla \Theta _A\)) as well as the gradient of each local user embedding (denoted as \(\nabla \mathbf {u}_i, i=1,\ldots ,K\)). The model parameters of the aggregator can be updated as \(\Theta _A = \Theta _A - \eta \nabla \Theta _A\). The gradients of the local embeddings, i.e., \(\nabla \mathbf {u}_i,i=1,\ldots ,K\), are sent to the corresponding behavior platform, respectively.

On each behavior platform, when it receives the gradient of the local user embedding, the platform will combine the local user embedding gradient, the input user behaviors and the current user model based on the function \(\mathbf {u}_i=f_{user}^i(behaviors; \Theta _{U_i})\) to compute the gradient of the user model, which is denoted as \(\nabla \Theta _{U_i}\) on the i-th platform. Then the model parameters of the local user models are updated as \(\Theta _{U_i} = \Theta _{U_i} - \eta \nabla \Theta _{U_i}\). Above model training process is conducted on different training samples for multiple rounds until models converge. We summarize the process of our FedCTR method in Algorithm 1.

In our federated model training framework, the user behaviors on different platforms (e.g., the ad platform and the multiple behavior platforms for user-interest modeling) never leave the local platforms, and only the model gradients are distributed from the ad platform to user server, and from user server to each user behavior platform. Since model gradients usually contain much less private information than the raw user behaviors [26], user privacy can be protected at the training stage of the proposed FedCTR method.

4 Experiments

4.1 Datasets and Experimental Settings

Since there is no publicly available dataset for native ad CTR prediction, we constructed one by collecting the logs of 100,000 users on the native ads displayed on the Bing Ads platform from 11/06/2019 to 02/06/2020. The logs in the last week were reserved for testing, and the remaining logs were used for model training. We randomly selected 10% of the samples in training set for validation. We also collected the search logs and webpage browsing behavior logs of users recorded by a commercial search engine during the same period for user-interest modeling. The detailed statistics of the dataset are shown in Table 3. In addition, we show the length distributions of different types of texts and user behavior histories in Figures 5 and 6, respectively. We assume that different kinds of user behavior data come from independent platforms and they cannot be directly aggregated. In this way, the behavior platforms include (1) the search engine that keeps users’ search behaviors, (2) the web browser platform that keeps users’ webpage behaviors, and (3) the ad display platform that records users’ ad click behaviors (the ad platform itself also serves as a behavior platform).

Fig. 5.

Fig. 6.

Table 3.

#users	100,000	avg. #words per ad title	3.73
#ads	8,105	avg. #words per ad description	15.31
#ad click behaviors	345,264	avg. #words per search query	4.64
#ad non-click behaviors*	345,264	avg. #words per webpage title	10.84
avg. #queries per user	50.69	avg. #webpages per user	210.09

Table 3. Detailed Statistics of the Dataset for Native Ad CTR Prediction

*We sample the same number of non-click behaviors with click behaviors for balanced model training.

In our experiments, the word embeddings in the user and ad models were initialized by the pre-trained Glove [30] embeddings. The ad ID embeddings are randomly initialized. In the CTR predictor, following Reference [1], we used dot-product as the CTR prediction function. Each self-attention networks had 16 heads, and the output from each head was 16-dimensional, which means that the total hidden dimension is 256. The query vectors in attention networks were also 256-dimensional. Adam [16] was used as the model optimizer (with a learning rate of 1e-3). The batch size was set to 32. The number of epoch is 2. The dropout [38] ratio after each layer was 20%. The strength of Laplace noise \(\lambda _{LDP}\) in the LDP modules was 0.01, and \(\lambda _{DP}\) in the DP module was 0.005. Hyperparamters were tuned on the validation set. To evaluate the model performance, on the Ads dataset, we used AUC and AP as the metrics. We reported the average results of 10 independent experiments.

4.2 CTR Prediction Performance

We compare the performance of FedCTR with several baseline methods, include:

•

LR [3, 37], logistic regression, a widely used method for ads CTR prediction. We used ad IDs and the TF-IDF features extracted from the texts of behaviors and ads as the input.

•

FM [35], factorization machine, which is a popular method for CTR prediction. We used the same features as LR.

•

Wide&Deep [5], a popular CTR prediction method with a wide linear part and a deep neural part. Same features with LR were used.

•

PNN [33], product-based neural network for CTR prediction, which can model the interactions between features.

•

DSSM [13], deep structured semantic model, a famous model for CTR prediction and recommendation.

•

DeepFM [10], a combination of factorization machines and deep neural networks for CTR prediction.

•

DIN [45], deep interest network for CTR prediction. We use the same ad and behavior models with our approach to learn ad and behavior features.

•

DIEN [44], deep interest evolution network for CTR prediction. Same ad and behavior features with DIN are used.

•

DIFM [23], a dual input-aware factorization machine for CTR Prediction. Same ad and behavior features are used.

•

NativeCTR [1], a neural native ads CTR prediction method based on attentive multi-view learning.

We report the performance of baseline methods based on user behaviors (i.e., ad clicks) on the ads platform only and their ideal performance using the centralized storage of behavior data from different platforms. The results under different ratios of training data of different methods are shown in Table 4. According to the results, we find the methods using neural networks to learn user and ad representations (e.g., FedCTR) perform better than those using handcrafted features to represent users and ads (e.g., LR and FM). It implies that the representations learned by neural networks are more suitable than handcrafted features in modeling users and ads. In addition, compared with the methods solely based on ad click behaviors on the ad platform for user-interest modeling, the methods that consider multi-platform behaviors in user modeling can achieve better performance. This is because the user behavior data on the ad platform may be sparse, which is insufficient to infer user interest accurately. Since user behaviors on different platforms can provide rich clues for inferring user interest in different aspects. Unfortunately, user behavior data is highly privacy-sensitive and usually cannot be centrally stored or exchanged among platforms due to the constraints of data protection regulations like GDPR and the privacy concerns from users. Thus, in many situations the methods that need the centralized storage of multi-platform user behavior data may not achieve the ideal performance in Table 4. Besides, our FedCTR method consistently outperforms other baseline methods. This is because our framework is more effective in leveraging multi-platform user behaviors for user-interest modeling than other baseline methods, and the multi-head self-attention networks in the ad and user models also have greater ability in learning ad and user representations than other architectures like FM, CNN, and RNN. Moreover, in our approach the raw behavior data never leaves the local platforms, and only user embeddings and model gradients are communicated among different platforms. Thus, our approach can model user interest in a privacy-preserving manner.

Table 4.

Methods	25%		50%		100%
Methods	AUC	AP	AUC	AP	AUC	AP
LR [37]	58.42	56.47	58.96	56.87	59.36	57.24
LR*	60.82	57.38	61.44	58.60	62.04	59.20
FM [35]	57.79	55.91	58.10	56.33	58.45	56.71
FM*	61.59	58.13	61.91	59.17	62.47	59.99
Wide&Deep [5]	59.45	57.80	59.66	57.97	60.04	58.61
Wide&Deep*	62.10	59.75	62.35	59.87	62.79	60.28
PNN [33]	59.53	57.86	59.73	58.03	60.02	58.50
PNN*	62.54	60.12	62.73	60.29	62.87	60.41
DSSM [13]	59.24	57.59	59.46	57.89	59.92	58.43
DSSM*	62.23	59.66	62.50	59.85	62.85	60.37
DeepFM [10]	59.36	57.70	59.55	57.92	59.83	58.28
DeepFM*	61.88	59.47	62.05	59.77	62.72	60.24
DIN [45]	60.91	59.22	61.32	59.43	61.52	60.04
DIN*	63.05	60.77	63.03	60.89	63.58	61.35
DIEN [44]	60.96	59.42	61.39	59.59	61.71	60.13
DIEN*	63.17	60.89	63.20	60.95	63.61	61.44
DIFM [23]	60.90	59.34	61.34	59.55	61.66	60.08
DIFM*	63.02	60.78	63.16	60.90	63.55	61.39
NativeCTR [1]	60.84	59.12	61.12	59.40	61.44	59.75
NativeCTR*	62.88	60.69	63.01	60.88	63.39	61.17
FedCTR	63.95	61.82	64.20	62.13	64.54	62.50

Table 4. Comparisons of Different Methods

*means the ideal performance under centralized user behavior data from different platforms and centralized model learning.

4.3 Effect of Multi-platform Behaviors

Next, we explore the effectiveness of incorporating user behavior data from multiple platforms for CTR prediction. We first study the influence of the number of behavior platforms for user modeling. The average results of FedCTR with different numbers of platforms are shown in Figure 7(a). From the results, we find that the performance improves as number of platforms increases. This is probably because user behaviors on different platforms can provide complementary information to help cover user interest more comprehensively. It shows that incorporating multi-platform user behaviors can effectively enhance user-interest modeling for CTR prediction.

Fig. 7.

We also study the influence of the number of user behaviors on each platform on the CTR prediction performance. We vary the ratios of user behaviors for user-interest modeling from 20% to 100%, and the results are shown in Figure 7(b). From the results, we find that the performance of FedCTR declines when the number of behaviors decreases. It indicates that it is more difficult to infer user interest when user behaviors are scarce. In addition, we find that the FedCTR method consistently outperforms its variants with single-platform user behaviors, and the advantage becomes larger when user behaviors are scarcer. It shows that utilizing the user behavior data decentralized on different platforms can help model user interest more accurately, especially when user behaviors on a single platform are sparse.

4.4 Study on Privacy Protection

In this section, we verify the effectiveness of our FedCTR method in privacy protection when exploiting multi-platform user behavior data for user modeling. Since user embeddings learned from different platforms may contain private information that can be used to infer the raw behavior data [24], we use LDP and DP to protect the local and aggregated user embeddings, respectively. In our experiments, we observe that the sensitivity of local user embeddings is about 0.05, which means that their privacy budget \(\epsilon\) is 5. The privacy budget of the aggregated user embedding is smaller, because more noise is added. To empirically evaluate the privacy protection performance of these embeddings, we use a behavior prediction task by predicting the raw user behaviors from the local and aggregated user embeddings to indicate their privacy protection ability. The evaluation framework is shown in Figure 8. More specifically, for each user, we randomly sample a real behavior of this user (regarded as a positive sample) and \(P-1\) behaviors that do not belong to this user (regarded as negative samples).⁶ The goal is to infer the real behavior from these P candidate behaviors by measuring their similarities to the user embedding. We perform dot product between the embeddings of each user-behavior pair to compute relevance scores \([z_1, z_2,\ldots , z_P]\), then all candidate behaviors are ranked according to the relevance scores.⁷ We use AUC as the metric to evaluate the privacy protection performance, and lower scores mean better privacy protection.

Fig. 8.

There are two key hyperparameters in the LDP and DP modules, i.e., \(\lambda _{LDP}\) and \(\lambda _{DP}\), which control the strength of the Laplacian noise added to user embeddings. We first vary the value of \(\lambda _{LDP}\) to explore its influence on privacy protection and CTR prediction, and the results are shown in Figures 9(a) and 10(a), respectively.⁸ We find that although the CTR prediction performance is slightly better if the local user embeddings are not protected by LDP, the raw user behaviors can be inferred to a certain extent, which indicates that the private information of local user embeddings is not fully protected. Thus, it is important to use LDP to protect the local embeddings learned from different behavior platforms. In addition, if \(\lambda _{LDP}\) is too large, then the CTR performance declines significantly, and the improvement on privacy protection is marginal. Thus, we choose a moderate \(\lambda _{LDP}\) (i.e., 0.01) to achieve a trade-off between CTR prediction and privacy protection. Then, we explore the influence of \(\lambda _{DP}\) on the performance of FedCTR in privacy protection and CTR prediction (under \(\lambda _{LDP}=0.01\)), and the results are, respectively, shown in Figures 9(b) and 10(b). We find it is also important to set a moderate value for \(\lambda _{DP}\) (e.g., 0.005) to balance the performance of CTR prediction and privacy protection. Besides, comparing the privacy protecting performance on local and aggregated embeddings, we find that it is more difficult to infer the raw user behaviors on a specific platform from the aggregated user embedding than from local user embeddings. We hypothesise this may be because the aggregated user embedding is a summarization of local embeddings, thus the private information is more difficult to be recovered.

Fig. 9.

Fig. 10.

4.5 Study on Computational and Communication Cost

In this section, we present some analysis of the computational and communication cost of FedCTR. Since the self-attention network has quadratic complexity with respect to the input sequence length, the total theoretical computational cost of CTR prediction on each platform is \(O(M^2+MN^2)\), where N is the number of words in a user behavior and M is the number of the behaviors of a user on this platform. In practice, the local training time takes around 2,300 s per epoch (two epochs in total). In our FedCTR model, there are 88.56M parameters in total. Among them, the word embeddings on different platforms take 84.92M parameters, and the Ad ID embeddings have 2.07 parameters. In the model training process, there are 21,580 rounds of communications among the user server and behavior platforms. The overall communication cost in the model training process is 168.59 MB (all parameters are in 32-bit float format), which is mainly handled by the user server. The size of communicated data is quite acceptable for the behavior platforms and user server, which usually have adequate communication bandwidths.

4.6 Effect of Aggregator and CTR Predictor

We also verify the effectiveness of the aggregator and CTR predictor models. First, we compare different CTR prediction models like factorization machine (FM) [10], dense layers, outer product [12] and dot product [1], and the results are shown in Figure 11(a). From Figure 11(a), we find the performance of FM is not optimal. It shows that FM may not be suitable for modeling the similarity between the user and ad representations learned by neural networks. In addition, we find that using a dense layer is also not optimal. This may be because dense layers compute the click scores based on the concatenation of ad and user representations while difficult to model their interactions, which is also validated by Reference [36]. Besides, it is interesting that dot product achieves the best performance. This may be because dot product simultaneously models the distance between two vectors as well as their lengths, which can effectively measure the relevance of user and ad representations for CTR prediction. Thus, we prefer dot product for its effectiveness and simplicity.

Fig. 11.

Then, we compare different models for user embedding aggregation, including attention network, average pooling, max pooling and concatenation. The results are shown in Figure 11(b). We find that average pooling is sub-optimal for aggregation, since it cannot distinguish the informativeness of different local user embeddings. In addition, max pooling is also not optimal, since it only keeps the most salient features. Moreover, although concatenating user embeddings can keep more information, it is inferior to using attention mechanism due to its lack of informativeness modeling. Thus, we use attention networks to implement the aggregator in the user server.

4.7 Influence of Model Size

Finally, we study the influence of model size on the CTR prediction performance. We change the model size by varying the hidden dimension in the ad and user models, and the corresponding model performance is shown in Figure 12. We find that the model performance first improves with the increase of hidden dimension of ad and user models, while the performance does not significantly improve or even declines if the hidden dimension is larger than 256. Thus, in FedCTR, we set the hidden dimension to 256 to achieve the best performance.

Fig. 12.

5 Conclusion and Future Works

In this article, we propose a federated native ad CTR prediction method based on multi-platform user behaviors. In our method, each platform learns local user embeddings from the local user behavior data, and upload them to a user server for aggregation. The aggregated user embedding is sent to the ad platform for CTR prediction. In addition, we apply LDP and DP techniques to the local and aggregated user embeddings, respectively, to better protect user privacy. Besides, we propose a federated model training framework to coordinate different platforms to collaboratively train the models of FedCTR by sharing model gradients rather than raw behaviors to protect user privacy at the model training stage. Experiments on a real-world dataset show that FedCTR is effective in native ad CTR prediction by incorporating multi-platform user behaviors with user privacy well-protected.

In future, we plan to deploy FedCTR to online ad system and test its online performance. We are also interested in applying FedCTR to enhance other CTR prediction tasks (e.g., search ads) by improving their user modeling part via incorporating multi-platform user behavior data in a privacy-preserving way.

Footnotes

https://www.msn.com/en-us.

https://gdpr-info.eu.

https://www.msn.com/en-us.

⁴

In our approach, we use the same hierarchical user model architecture on different platforms for simplicity, but note that FedCTR does not require them to be the same and customized user models are supported.

⁵

In our experiments, we assume that search queries and browsed webpages are in textual form, and clicked Ads are associated with both texts and IDs. Note that FedCTR can be easily generalized to other types of user behaviors by simply replacing the behavior models for text/ID modeling with other architectures.

⁶

We randomly select nine negative samples for each positive sample.

⁷

Note that the model parameters are frozen.

⁸

The DP module is deactivated in these experiments.

References

[1]

Mingxiao An, Fangzhao Wu, Heyuan Wang, Tao Di, Jianqiang Huang, and Xing Xie. 2019. Neural CTR prediction for native ad. In Proceedings of the CCL. Springer, 600–612.

Abstract

1 Introduction

2 Related Work

2.1 CTR Prediction

2.2 Federated Learning

3 Methodology

3.1 FedCTR for Native Ad CTR Prediction

3.1.1 Overall Framework.

3.1.2 Model Details.

3.2 Federated Model Training

4 Experiments

4.1 Datasets and Experimental Settings

4.2 CTR Prediction Performance

4.3 Effect of Multi-platform Behaviors

4.4 Study on Privacy Protection

4.5 Study on Computational and Communication Cost

4.6 Effect of Aggregator and CTR Predictor

4.7 Influence of Model Size

5 Conclusion and Future Works

Footnotes

References

Cited By

Index Terms

Recommendations

Neural CTR Prediction for Native Ad

Pro Smartphone Cross-Platform Development: iPhone, Blackberry, Windows Mobile and Android Development and Distribution

Deep User and Item Inter-matching Network for CTR Prediction

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations