1 Introduction
Our digital life is composed of receiving, processing, communicating, and producing information. We tend to organize digital activities around tasks that are contextualized by entities, such as apps, documents, people, and various keywords. These entities are semantic data objects which have properties corresponding to real-world objects they represent [
28] and specify the context of our activities. We accomplish our digital tasks by a set of interactions with entities on our digital devices, which then trigger a series of screen transitions following each interaction. The problem faced by users when engaged in digital tasks is how to allocate their limited cognitive resources to find and access the required entities from a wide range of data [
51]. Digital life is characterized by multitasking and frequent interruptions, requiring frequent activity switches. For instance, consider a user engaging in multiple tasks every day; for each task, she works on different documents, opens different applications, browses the Web with specific keywords related to tasks, and communicates with colleagues about tasks. Therefore, during a day, the user is at different states working on multiple tasks that are associated with a set of entities. Furthermore, many forms of these interactions are repetitive—we visit the same websites, send messages and reply to e-mails, and open previously visited documents. With the growing number of entities, searching has become increasingly important for finding information on personal computers. This retrieval process can be time consuming and cognitively challenging, as the entity to be retrieved (e.g., a file, name of a person, or an address on the Web) may be difficult to recall. As a result, information overload can pose an additional challenge to the progression of digital tasks. Individuals have to remember which entities are associated with each task to restore the information related to that task if needed. Consequently, designing personal assistants and contextual recommendation systems that can understand the different states of the user and can support the management of tasks has gained increasing interest [
12,
35,
45,
55,
71,
75].
Over the past few decades, the use of assistive technologies and tools has changed how information work is carried out. Many diverse user interfaces and interactive techniques have been developed to facilitate accessing previously used items and managing tasks [
12,
26,
28,
47]. Examples of this include Web page recency lists [
31,
46], showing previously opened documents [
66,
73], and personal information management systems [
7,
11,
49]. By exposing users to a larger variety of what may be of interest to them, they can find what they are looking for quickly and efficiently. For example, in the context of cloud-based platforms such as Google Drive and Microsoft Office 365, recommendations are intended to facilitate access to the documents users are likely to need in the near future, thereby eliminating the burden of memorizing folder structures and automating document management [
74]. However, the majority of proposed personal assistants are based on heuristic methods that consider recency and frequency without modeling the user’s activities.
Screen recording of digital devices (e.g., laptops, tablets, and smartphones) can provide a wealth of information about the ongoing digital tasks of the users, and consequently their states. Manual maintenance of such information collections and the analysis of data acquired from screen recordings are time consuming and not feasible for long-term studies involving weeks or months. To capture users’ ongoing states, we need to develop a representation of their activities. The use of activity mining to automate maintenance of such collections appears to be a promising alternative [
55]. Activity mining extracts unique activities from a stream of interactions with entities by utilizing interaction histories. The existing literature in activity mining demonstrates varying degrees of success in limited study setups in labs. Despite previous efforts, most approaches struggle to maintain two important aspects, namely (a) considering cross-app rich entities or (b) modeling the temporal behavior.
Despite a large body of work devoted to modeling user digital behavior, most of the advances were focused on predetermined interaction logs (e.g., only query logs, e-mail, or Web browsing history) [
40,
68,
76], or data acquisition has been limited to a certain application or predefined tasks [
30,
36,
43,
75]. In particular, the context of the user task is mainly determined based on the user’s Web activity, such as recent Web queries issued by the user [
15,
42] or the blogpost or Web document the user is composing [
5,
17,
25,
35]. However, there are many other sources of contextual information that can be useful in determining the user state, such as 24/7 digital behavioral recordings that are not restrained to a specific application or a type of user input. Previous approaches have not been effective in utilizing rich features that are present on the user screen in real-life digital activities, as well as considering complex co-occurrence between different types of entity appearing on the screen, which this article aims to address. Furthermore, users’ interests are dynamic and constantly changing, and are influenced by their previous behavior. Identifying users’ dynamic preferences based on their historical behavior can be challenging but are essential for personal recommendation systems. Some of the previous studies fail to capture the sequential development of contexts over time, or only model linear dynamics of user representations, which are insufficient to capture nonlinear dynamics in human behavior [
12,
28,
35,
45,
71].
There has been fairly little research on approaches that automatically learn the user states and accordingly predict the user’s needs in real-life digital activities. In this article, we present
entity footprinting, which is an approach for entity recommendation in realistic everyday digital tasks, based on user model learned from images captured from the screen. To collect the user’s 24/7 digital behavioral recordings, we employed a screen monitoring approach that captured all user interaction data and generated visual content (e.g., visual content presented to the user on the screen) across application boundaries. A user state model is built of heterogeneous, multiple (temporal and topical) aspects of data that can be contextualized by several entities, such as applications, documents, people, and various keywords. The model is then utilized to predict a subsequent user state and entities relevant to that state. To this end, we aim to answer two main research questions. Our first question aims to identify the user state:
RQ1: Can we automatically identify and distinguish users’ states from their everyday digital activities?
Beyond the state identification, we are particularly interested in predicting which entities the user will be interested in working on next, which leads to our second question:
RQ2: Does user state prediction help in recommending more relevant entities to users in the context of their daily digital activities?
The model that solves the aforementioned issues should satisfy the following four characteristics: (1) it should follow an unsupervised learning approach (i.e., the model needs no prior knowledge about the categories of activities or the labeled data); (2) due to the high dimensionality and sparsity of the extracted data from users’ screens, and therefore the high computational cost of the data processing, the data must be clustered into meaningful clusters that take textual content into account and represent different states of a user; (3) it should take into account the time-varying nature of human behavior; and (4) to recommend entities to users, the predicted states must be converted into a ranking over entities. To answer those research questions and fulfill these characteristics, we present a novel approach for data-driven modeling of users’ state in their daily digital activities. This model is able to predict the entities that the user is likely to find relevant given the user’s interaction history.
Due to the high dimensionality and sparsity of digital behavioral data with the size of several thousands of entities occurring in the entire recording history, the user state is modeled using a topic modeling approach wherein a topic represents a user state. Statistical co-occurrence patterns among entities justify the application of the topic model in identifying the underlying latent thematic structure of the data. However, this model on its own tends to suffer from disregarding the order and not taking into account the temporal behaviors of the user. Therefore, in this work, to address this problem and to capture the sequential signals underlying users’ behavior sequences, we use the powerful
Bidirectional Long Short-Term Memory (BiLSTM) model [
54] to identify the sequential relatedness of states. Moreover, we employ the self-attention mechanism [
10,
69], to learn a better representation of the user’s state in the behavior sequence by leveraging sequential information, to accurately predict users’ subsequent state and accordingly recommend the most relevant entities related to the predicted state. The diagram of the proposed model is represented in Figure
1.
To evaluate the model, we conducted an offline analysis on a collected real-world digital activities data in which all information appearing on the screens of 13 users during a period of 14 days was captured automatically via screen monitoring and converted to texts using
Optical Character Recognition (OCR) [
28].
The main contributions of this work can be summarized as follows:
—
A new representation for characterizing digital activities: the entity footprint across boundaries of applications that utilizes contexts acquired from a monitoring system to capture the user state.
—
A user model capable of predicting the user state in digital life based on entity footprinting and predicting needed entities at the right time.
—
An empirical evaluation showing how the proposed user model improves the prediction performance compared to baseline models.
The article is structured as follows. In the following section, a discussion of related work is provided, and in Section
3, we introduce the data acquisition approach in entity footprinting and the user interface implemented in this work. In Section
4, we present the user model overview and problem formulation including user state modeling, sequence modeling, and entity recommendation. Dataset creation and experimental exploration is presented in Section
5. We evaluate our proposed method and compare it with the baselines in Section
6. Finally, we present a discussion and conclusion in Sections
7 and
8, respectively.
4 User Model in Entity Footprinting: Problem Formulation
The user model in this work captures the user’s interactions with the system that consists of information objects acquired from the user’s screen. It then models the user’s contexts, uses this model to infer the user’s state at each timestep, learns the preferences of the user, and finally provides the relevant entities to the user. An illustrative example of a sequence of information objects recorded from a user’s screen and the corresponding predicted entities are shown in Figure
3. We consider that each interaction with the computer can be recorded as a tuple of information objects and the time at which the interaction occurred. Each information object itself consists of a collection of entities, including the title of the document, the application to which the document belongs, and the screen content (keywords and persons). The
ith interaction,
\(\Omega _i\), in a sequence can be expressed as follows:
where
\(t_i\) indicates the time when an interaction with a particular information object of
\(o_i\) occurred. Each
\(o_i\) is
where
\(s_i\) is the title of the active screen,
\(a_i\) identifies the app to which the information object belongs, and
\(w_i\) and
\(p_i\) refer to the screen content that contains a collection of entities including keywords and people names, respectively. The sequence of user digital activities may be viewed as a series of these tuples.
Based on the bag of words model, each information object
\(o_i\) can be represented by a bag of individual entities
\([\epsilon _1, \ldots ,\epsilon _{|E|}]^T\) in which nonzero elements are the entities present in the current information object.
E is the set of all unique entities, including screen titles, app names, keywords, and people names extracted from the entire recording history, and
\(|E|\) denotes the set’s size. The logged digital activities of the user are stored in the matrix
\(X \in \mathcal {R}^{|E|\times N}\) shown in Figure
4, where columns are a sequence of information objects
\(o_i\)s and rows are entities extracted from the user’s screen. The
\((i,j)\)th element is 1 if the
ith entity exists in the
jth information object.
Given a sequence of previous information objects, we are interested in predicting which entities are likely to appear next, at the
\((n+1)\)th step:
To achieve the function f, we will utilize a machine learning model. Due to the fact that \(|E|\) is too large (it can reach thousands within our dataset), implementing machine learning models directly on these highly sparse and large vectors would require enormous amounts of data. Therefore, we first cluster the stream of information objects into different states by inducing the textual content and exploiting the co-occurrence patterns among entities. In this way, we reduce the size of the dataset by formulating a semantic and meaningful representation for the collected entities. The second step in achieving the function f is to encode the history of a user by modeling a sequence of interactions. Finally, based on the encoding of history, we intend to design the model in such a way that recommends those entities to the user that are most likely to appear in the next timestep.
4.1 User State Modeling
Identifying the states of the user from a sequence of interactions with the computer can be seen as the task of clustering sequential data. By creating representations of a user’s state, we can determine what tasks the user is focused on at any given time instant. To generate more advanced representations, which consider an entity’s relevance within the context of the user’s overall state, we focus on topic modeling approaches. These representations can be used to automatically match a topic to each state of the user to identify the task they are working on. The most extensively used topic model for clustering data is the
Latent Dirichlet Allocation (LDA) model, where a finite number of topics are defined in advance [
3]. Its enhanced model is the hierarchical Dirichlet process, which is the nonparametric counterpart of LDA and has an infinite number of topics [
65]. These algorithms are designed to discover hidden thematic structures in a collection of documents and rely on the co-occurrence of words to make cluster inferences. In these methods, the probability of a word being assigned to a particular topic is determined by the word count of that topic.
In recent years, researchers have also been exploring the idea of clustering document streams into clusters based on the temporal sequence in which they arrive [
13,
24]. The new models do not require a fixed size dataset; instead, they can be applied to a stream of documents arriving sequentially, with the number of clusters updated automatically. In this work, to deal with continuous stream of screens, we implement a
Dirichlet-Hawkes Process (DHP), which is a probabilistic generative model that combines the strengths of Bayesian nonparametrics as well as the Hawkes process [
13]. The DHP is a continuous-time model for streaming data that allows for self-excitation. The key idea in the DHP is that the Hawkes process (one kind of temporal point process) is adopted to model the rate intensity of information objects, whereas the Dirichlet process is used to capture the state information objects’ cluster relationships in which each cluster represents a state that contains information objects related to that state.
Let us show the latent state indicator by
\(z_{1:n}\). Given a stream of information objects
\({(o_i,t_i)}_{i=1}^n\), the inference algorithm in the DHP is composed of two subroutines. First, it samples the latent cluster for the current information object
\(o_n\) by sequential Monte Carlo and then updates the learned triggering kernels of the corresponding cluster in the progress. The DHP generates a series of samples
\(\theta _{1:n}^o\) corresponding to these information objects. Each state will have a distinctive value of
\(\theta _i^o\). If there are
K distinct values
\(\theta _{1:K}\) at time
\(t_n\), then
\(z_n \in {1, 2, \dots , K, K+1,}\) where
\(z_n = K+1\) denotes a new state and
\(0\lt z_n \le K\) denotes an existing state. Let the uniform prior
\(\theta _0\) be a
\(|E|\)-dimensional vector (where
\(|E|\) denotes the size of unique entities set) where every element is a constant value. The posterior is decomposed as
\(P(z_n|o_n, t_n,rest) \sim P(o_n|z_n,rest) P(z_n|t_n,rest)\) by Dirichlet-multinomial conjugate relation. Then the likelihood
\(P(o_n|z_n,rest)\) is given by
Here,
\(C^{z_n}\) is the entity count of cluster (state)
\(z_n\),
\(C^{o_n}\) is the total entity count of information object
\(o_n\), and
\(C_{\nu }^{z_n}\) and
\(C_{\nu }^{o_n}\) are the corresponding counts of the
\(\nu\)th entity. Finally,
\(P(z_n|t_n, rest)\) is the prior given by the DHP as
where
\(\lambda _0\) is the base intensity of a background Poisson process,
\(\lambda _{\theta _k}\) is the intensity of the Hawkes process corresponding to the
kth state, and
\(\gamma _{\theta _i^o}(t_n, t_i) = \exp (-|t_n - t_i|)\). Using these probabilities, sequential Monte Carlo sampling is used to infer the state label of each information object.
This model is able to learn a representation of each observed input at each timestep and provide an appropriate framework for generating a representation of the user state based on the observation of digital activity. Additionally, as the digital activity of the user arrives at streaming fashion and has time information, we can leverage both the content and time information to better cluster the activities, or what we call states, of the user. However, this model still tends to suffer from disregarding the order and not taking into account the sequential information of states and recurrent activities of the user.
4.2 User State Prediction
The state representation explained in the previous subsection is aimed to cluster what is on the user screen at each time frame. We also can compress what happens over time. Our focus is on a frequently encountered question: can we predict the kind of activity a user will undertake in the future based on the sequence of activities observed in the past? How do past states affect the occurrence of future states? To correctly understand user preferences, one must be able to account for the information about the sequential behaviors and inherent dynamics in the behavior. Therefore, the second component of our model is sequence learning on the user state. This module is aimed to process the sequence of input and predict the most likely future continuation of the sequence, which is the state that is expected to be reached by the user. By modeling the sequences, we can learn the digital activity patterns of the users. As an example, the occurrence of one event related to checking Twitter may result in a series of events about other social media such as Facebook. Generally, when a user is working, information objects that appear in close proximity to one another tend to share a similar topic. This implies that the appearance of a specific topic is likely to be followed by the emergence of related topics in a nearby time frame.
One typical approach to model temporal dynamics in user behaviors is to use a latent autoregressive model. This algorithm updates the latent state using \(h_{n+1} = f(h_n,o_n),\) and the observable state is derived from \(o_{n+1} = g(h_{n+1},o_n)\) for some data \(o_n\). Functions f and g are nonlinear functions that can be learned from data and are commonly referred to as Recurrent Neural Networks (RNNs) in deep learning. One of the most used variants of RNNs is LSTM, which contains specially designed units to avoid vanishing gradients.
Our technique is based on the idea of seeing the state as a nonlinear function of the state’s history and parameterizing it using an RNN. In this model, user state history can be encoded into a compact vector representation, from which the subsequent state of the user can be predicted. Encoding of user interaction history into a compact vector (representing the user’s preferences) can be done using the basic paradigm of the left-to-right sequential model. Despite their popularity and efficacy, such unidirectional left-to-right models are insufficient for learning appropriate representations of user behavior sequences. These models were initially developed for types of sequential data that have natural order, such as text and time series data. Therefore, encoding is done only on data from previous items. However, users’ behaviors in real-world applications may not always follow this rigidly ordered sequence [
27,
62,
72]. When modeling user behavior sequences, we can consider context from both directions. LSTMs with bidirectional properties can learn input sequences both forward and backward, leading to both interpretations being concatenated and embedded within the hidden state. Our intuition behind using the BiLSTM neural network is to use all available information and effectively model the local dependencies between certain states of the user in a temporal manner.
Formally, given a state
\(z_n\) at the timestep
n, corresponding hidden state
\(h_n\) can be derived by using the equations defining the various gates used in LSTM as follows:
where
\(c_n\) denotes the cell state. To capture the long-term dependencies, the LSTM cell adds internal gating mechanism.
i,
\(f,\) and
o are the input, forget, and output gates, respectively, in Equation (
7). These gates control how information is added to or removed from cell states along the sequence of state updates.
\(z_n\) and
\(h_n\) are the one-hot vector of input state and the LSTM hidden state at timestep
n, respectively.
We divide a sequence of user states
\({z_1,z_2, \ldots ,z_n}\) into a fixed-sized sliding window of size
W for
\(n = 1, \dots , N\), and each sequence is formed as
\(\lbrace z_{n-W+1}, \ldots ,z_{n-1},z_{n}\rbrace\). Given the last
W of user states in this window, the LSTM network performs the following:
The forward layer output sequence,
\(\overrightarrow{h}\), is iteratively calculated using inputs in a positive sequence from time
\(N-W\) to time
\(N-1\), whereas the backward layer output sequence,
\(\overleftarrow{h}\), is calculated using the reversed inputs from time
\(N-W\) to time
\(N-1\). The desired output that is the prediction of the next topic is then produced at each timestep from
\(h_n\):
where
g is an arbitrary differentiable function followed by a softmax. The BiLSTM network accepts a sequence of
z (state) as input and outputs the next
z.
Although BiLSTM utilizes the user’s sequential behavior to capture the long-term dependency in the contextual user state, this approach cannot focus on the important information and the user’s main purpose within the obtained contextual state. In real-life digital activity, there are situations where a user is working on a specific topic but accidentally opens a document or clicks on a wrong link that opens an irrelevant Web page. While these actions are part of the user’s behavior sequence, they are not the primary focus of the user at that time. As a result, it is crucial to contemplate the main goal of the user in each session in addition to the sequential behavior. By concentrating on the important aspects of the contextual state, we can boost the accuracy of our prediction. The attention mechanism can highlight important information by setting different weights. BiLSTM combined with the attention mechanism can enhance the prediction accuracy even further [
41].
To train the model using back-propagation, the \(\text{loss} (\hat{z}_{n+1}, z_{n+1})\) is measured using categorical cross entropy. The trained network can then serve as a model to predict the future state in the test dataset. The output of the network depends not only on the latest state but also on a sequence of states.
4.3 Entity Recommendation
The topic model in the first subsection provides the probability of entities at each state. The sequence model then predicts the probability of each state at the next timestep after the final softmax. By knowing these probability values at timestep
n, the probability of a given entity
\(\epsilon _n\) assuming
Z states is computed by
Top k entities are generated by sorting entities in descending order. In other words, entities in each type (apps, documents, people, and keywords) that are most consistent with the future state are retrieved.
5 Experimental Study
To investigate the research questions, we collected data from 13 users as they accomplished their daily digital tasks. Five males and eight females with an average age of 25 years were recruited to take part in the study. Participants with higher educational backgrounds were chosen, as they were likely to use their personal laptops for work-related tasks, allowing us to collect more realistic data. Upon joining, the study participants were informed of their privacy and told that their data would be encrypted and stored on a secure server, and used only for research purposes. As compensation for participating, they received 120 euros.
The research was carried out in accordance with the ethical guidelines of the University of Helsinki. Regarding the data usage policy and procedure, participants were asked to complete a consent form. The research plan and informed consent form were approved by the Ethical Committee of the University of Helsinki. It is important to note that all logs are stored locally, the logging tool does not upload any data to the cloud, and all evaluation scripts utilizing these logs were run locally on the computers of participants.
The monitoring system was installed on participants’ laptops, and digital activities were continuously recorded in the background thread for 14 days. The system was set to launch automatically whenever the laptop was turned on. Participants could stop the system anytime; however, we advised them to avoid doing so during the monitoring period unless it was necessary.
5.1 Data Description
The data were preprocessed into a standardized format, consisting of a stream of information objects, each comprising of the merged screenshots of documents with the same window title, which includes a set of entities: an application name, a document title, keywords and non-keyword terms in OCR-processed text units. We used frame difference methods to exclude duplicate keywords and terms constantly appearing on the screen and focused only on the information change. Due to a large number of occurrences with respect to various browsers, we decided to extract the domain names of the Web pages visited and considered them to be separate applications.
5.2 Data Analysis
Table
1 summarizes the data collected during the 2-week digital activity monitoring of 13 participants. The number of recorded information objects per participant was 2,903 (
\(SD=1{,}388\)), which corresponded to an average of 78 hours (
\(SD=73\)) of computer usage per participant. The average number of unique documents and unique applications accessed per participant was 811 (
\(SD=326\)) and 140 (
\(SD=52\)), respectively. An average of 241 (
\(SD=208\)) people entities were found from the data. Keyword extraction from OCR-processed text units resulted in 35,400 (
\(SD=16{,}611\)) keywords and 17,534 (
\(SD=6{,}859\)) non-keyword terms per participant.
5.3 Training Details
Hyperparameters for the DHP include setting
\(\lambda _0 = 0.05\) and
\(\gamma _0 = 0.1\). For the inference, we used sequential Monte Carlo sampling with eight particles. For BiLSTM, we used a sequence length of
\(W=10\). The BiLSTM network was modeled using two layers and 64 neurons on each layer. The network parameters were learned using mini-batch stochastic gradient descent algorithm, where the batch size was set to 32, the dropout rate set to 0.5, and the learning rate initialized to 0.001. Categorical cross entropy was used as the loss function. The loss on the validation set was also used as the criterion for the early stopping of the training. We split each user’s data into training and test sets. We selected
\(80\%\) of the data for training and used the remaining
\(20\%\) as the test set, and the evaluation objective for prediction experiments is that given
\(80\%\) data for training, we want to assess the predictive quality for individual user states and entities issued during the remaining
\(20\%\) of data. We also sampled
\(20\%\) of the training set as a hold-out validation set. BiLSTM models were trained for 1,000 epochs (i.e., 1,000 iterations over the entire training set) and then evaluated against the validation set. The model parameters with the best performance on the validation set were selected and then evaluated on the test set. The BiLSTM networks were implemented using TensorFlow library and trained on a machine equipped with GPUs.
37 Discussion
The main contribution of this article is to introduce entity footprinting for predicting user information needs using contextual information available on the screen of the user. This system collects an individual’s entity footprints from personal digital devices and automatically extracts state representation that can be used to learn semantic relationships of the collected information and to find related entities such as documents, apps, people, and keywords. A user study provides evidence that our system is able to proactively produce relevant resources and is suited for re-finding previously seen entities.
It is important to note that entity footprinting differs significantly from other recommendation tasks, such as movies, songs, and shopping items. In this research, users have a lot of information about the entities with which they have previously interacted, and they have a clear objective with regard to finding or relocating specific entities (e.g., documents, apps, people) when they work with their digital devices. Therefore, a successful recommendation system in the setting of entity footprinting requires an accurate recommendation algorithm. A more accurate recommendation indicates that the proposed method can reduce users’ manual search effort by providing them with more relevant and useful information.
Furthermore, the entity footprinting presented in this article differs from the other personal information management systems in three ways. First, our approach in this work is on being proactive, in the sense that it does not require any action from users and instead exploits context from users’ screens and past interactions to predict users’ needs in the future and provide them with information that is relevant to their predicted tasks. Aside from that, users’ everyday digital activities are heterogeneous, meaning that they are not limited to a specific application and can switch across several applications. Therefore, the second difference is that the entity footprinting is principally based on screen recordings, hence making it a general system that is agnostic to tasks users perform or applications they use for their tasks. In this work, we did not conduct any experimental study with the controlled settings in the lab, and we examined in-the-wild data collection and real-world tasks. By using a single data source (users’ screen), we were able to create a rich user model without requiring any human supervision. Entity footprinting was examined especially on datasets acquired from mostly knowledge workers; however, it was not limited to knowledge work and can be applied to other types of computer users as well. Participants in our user study were engaged in different types of tasks ranging from writing a thesis and coding to checking social media, online shopping, and reading news. Within these tasks, participants took part in different activities, and their intents and preferences changed frequently. Due to this drift in intents and preferences for entities over time, entity footprinting should be time sensitive. Therefore, the third difference is that our proposed model explicitly takes into account the temporal behavior of the user.
The proposed system can augment the human with a digital memory of entities interacted. This memory can be used in different applications building on personal digital data. Among these applications are (1) proactive search, in which users are provided with information based on their past behavior rather than explicitly querying for information, and are able to learn their interests and search preferences based on their history; (2) time line search, which aids in recalling events and searching for specific information by displaying the information on a graphical time line; and (3) associative recall, which deals with specific relationships between entities. It is possible that we remember some partial information, but not the exact information we are looking for. Entity footprinting provides cues that can facilitate associative recall.
Results on state prediction accuracy showed that the proposed method is able to capture the users’ rapidly evolving preferences and consequently provide them with entities that are actually used in their tasks. Our findings provide evidence that considering contextual as well as temporal information can help the entity footprinting system identify significantly more relevant entities than other baselines when performing real-world digital tasks. Nevertheless, utilizing a digital activity monitoring method is not without limitations. Here, we acknowledge the limitations of our study and outline potential research areas for future studies.
Artifact Access. Some information, such as Web bookmarks, may always be visible on the active windows regardless of the task at hand. The model may be confused by this information.
Experiment Limitations. Our findings were based on a 13-person experiment that lasted for 2 weeks. To ensure that our findings will be valid for a broader population, a larger experiment over a longer period of time will be necessary. Although our observations have provided us with valuable insights, the possibility of improving the prediction accuracy could have been enhanced by longer sessions and more data.
Generalization. It is not feasible to generalize from one person’s collection of personal information objects to another because of the abundance of specialized tasks, keywords, and entities used. Therefore, our model should operate at an individual level by processing data from each user’s device, without relying on collective patterns across multiple users.
Privacy. The monitoring system introduced in this work may contain sensitive information. However, this is a common issue with most personal assistant systems. Some participants disabled the monitoring temporarily during some activities. More study of the concealed data could assist in automating the process of setting the privacy boundaries that users expect.
Influence on User Behavior. An evaluation of performance could be conducted, rather than focusing on relevance, to quantify the usefulness and impact that comprehensive entity footprinting can have on users’ daily digital activities.
7.1 Future Work
There are several future directions for this work. One could improve the model by incorporating other temporal features such as duration and timing of events and activities, and here, the Hawkes process can play a crucial role in creating more accurate models. There is important information to be gained by analyzing the precise interval between two events to understand the dynamics of the underlying behavior. The characteristics of these data establish a fundamental difference from independent and identically distributed time series data, where time is viewed as an index rather than randomly distributed variables. The second direction of extensions can also further investigate the dependence between the user states and their transitions. Most user states discovered by our models generally correspond to repetitive human tasks. By recognizing which routine a user probably engages in, a collection of related entities can be recommended.
The data stored in entity footprinting is in textual form. Even those files that contain images, videos, or voices are stored and retrieved by the name of the file in textual form. It is, however, possible to convert these types of data into textual data (e.g., speech-to-text, visual concept detection) and augment them with extracted textual information.
Furthermore, there is no comparison with other temporal dynamics modelings (e.g., RNN, transformers). Using RNN at the entity level may result in higher accuracy in predicting entities, as it can take advantage of entity-level statistics. However, the aim of this work was to investigate the possibility of modeling user states via digital activity monitoring. Therefore, comparing other advanced models such as transformer-based models and the proposed method is an interesting area of future work.
8 Conclusion
Despite the fact that entity recommendation systems are becoming a common feature of personal assistants and commercial platforms, research in this area is limited to specific applications or predefined tasks. However, our focus in this article was to study the applicability of entity footprinting in everyday digital life. We investigated how much the proposed approach is able to understand the users’ states while they perform their everyday digital tasks using heterogeneous applications. This has been enabled by a digital activity monitoring system, which allowed context extraction across application boundaries. By automatically predicting and presenting relevant information in advance, users can easily access the information without having to formulate specific queries. The proposed model (1) is unsupervised and does not need any knowledge about the categories of activities or tasks, (2) clusters the high-dimensional digital activity data into a meaningful states, (3) considers the time-varying nature of the human behavior by a sequential and attention model, and (4) represents the predicted states as ranking over entities to be able to recommend top-ranked entities.
To validate our approach, we implemented it in the introduced entity footprinting system and conducted a user study with a realistic dataset. We investigated the impact of the user’s state dynamic evolution on finding more relevant entities. In the earlier study based on the EntityBot system, the user model relied on linear modeling to address the challenges inherent in high dimensionality, limited explicit interaction, and being real time for interactive use. However, in this work, we developed a more complex model using the DHP topic model and BiLSTM network to learn the user states and their dynamic behavior. In this work, we presented a prediction framework for user state that incorporates two factors influencing user digital daily behavior: context and user preferences. We evaluated this framework with a 2-week, 13-subject field trial and compared it to heuristic, static, and other baselines.