1 Introduction
Simulation models of human–computer interaction (HCI) are computer programs that produce moment-by-moment predictions of an interactive user’s action (e.g., [
30,
31,
49,
51,
54,
68,
74]). In technical terms, they consist of
model parameters and executable
mechanisms, which together produce simulations. The mechanisms represent the processes that underlie user behavior, whereas models’ parameters in general describe cognitive, physiological, or biomechanical characteristics of a simulated user. For example, parameters in a model of point-and-click behavior [
15] characterize the noise of the human visual and motor systems, and a model of menu-search behavior [
10] has parameters linked to memory recall. Such parameters are latent (not directly observable) and differ in values between individual users [
34] or user groups [
69]. Through finding the appropriate values, the model can accurately reflect behavioral differences between individuals.
In
inverse modeling, the goal is to infer model parameters – that is, computationally find the most plausible parameters that explain a given set of observed behaviors (see Figure
1(a)). In statistics terms, inverse modeling involves
parameter fitting. Solving the inverse modeling problem could enrich various applications in HCI. A model with well-inferred parameters can predict individual users’ behavior in the given task; therefore, related individual-level parameters often have been regarded as essential ingredients for personalizing user interfaces [
68,
69]. Inverse modeling informs not only developing explanations from observations but also devising new simulation models for HCI work. For example, it aids in verifying whether a particular posited mechanism correctly captures the phenomenon (e.g., matching empirical datasets) while the inferred parameters correspond to prior knowledge of user attributes or in identifying the most credible model from among several candidates [
34]. To exploit this potential more fully, inverse modeling often emphasizes finding the entire distribution for a parameter rather than just a single point value [
35].
However, parameter inference for the HCI field’s simulation models has proven prohibitively computation-intensive: inferring parameters from observations takes a very long time. The great complexity of today’s user simulation models – arising from their hierarchical structure, the need to optimize policies, etc. – rules out obtaining a probabilistic representation of user behavior for given inputs to the model. Statisticians refer to the likelihood function as intractable. Since these conditions preclude fast likelihood-based inference methods such as maximum likelihood estimation (MLE), we have had to resort to simulation-based inference techniques, such as approximate Bayesian computation (ABC) [
4]. Relying on repetitive simulations, they search possible ranges of parameters to estimate the values that best replicate given observations (see Figure
1(b)). These methods scale poorly, though [
11]. Researchers report that the parameter-search process requires immense computation time, from several hours [
18,
49] to days [
34]. For each new datum for inference (e.g., observed behavior of an added user), the entire parameter-search process must be repeated from scratch.
In light of this issue, the paper examines
amortized inference as a novel workflow that mitigates the computation-cost problem in HCI’s inverse modeling. Amortized inference is an emerging approach for machine learning designed to address the “likelihood-free inference problem” [
11,
23,
38,
58,
61,
72]. It uses
upfront computations to speed up future inference; i.e., the computation cost for inference is
amortized. This is accomplished through training a proxy model – specifically, a
density estimator – by means of deep neural networks (see Figure
1(c)). The estimator functions to estimate parameters’ posterior probability when given observed behavior and to represent the complex posterior distribution over the full range of these parameters. Training a density estimator requires a large synthetic dataset produced by running the model with various parametric inputs that could plausibly occur within the domain. The training phase entails high computation cost but takes place only once. Once trained, the density estimator can, in the second phase (inference), quickly and continually infer individual users’ model parameters (with the posterior distributions) from their behavior data.
We studied and developed amortized inference for the purpose of fitting simulation models to individual-level HCI data. Three motives lay behind this: an interest in gauging possible improvements in computation efficiency; our wish to study the prediction accuracy achievable, comparing the levels with those from non-amortization methods such as ABC [
34,
35]; and, on the assumption of promising results, aims of exploring amortized inference for scaling individual-level fitting to large datasets for HCI. Our central goal was to see whether amortized inference could aid in estimating full distributions of cognitive parameters in a user population, something beyond the reach of previous methods.
We begin the discussion by presenting the novel workflow for applying amortized inference in HCI. Then, we report on our validation of it via three challenging cases that entailed fitting previously published simulation models to datasets from actual users. For each case, we trained a density estimator from only the model’s simulated data, then used it for inference on real user data. For Case 1 (
N=18), we fitted a reinforcement-learning-based user model of menu search [
10] to infer cognitive parameters (e.g., eye fixation duration) from aggregated user task-performance metrics (e.g., averaged completion time over whole trials). With Case 2 (
N=20), we fitted a point-and-click model [
15] to infer parameters of the visual and motor system. For this, we employed complex trajectory data (cursor trajectories from multiple trials) from an individual. Case 3 involved fitting a touchscreen-typing model [
30] to a large real-world dataset [
56] that had not been used for testing that model before. We inferred the distributions of model parameters (e.g., such as the fingers’ motor performance) for 1,057 individual users from keylog data in an online typing experiment. Also, we report the results from testing the model on a user task not encompassed by the training data, and we examine the effect of having a strong prior. The paper concludes with suggested practical applications of the low-cost inverse modeling approach, in light of our findings, and discussion of challenges that remain.
The paper offers three main contributions to inverse modeling in the HCI field:
(1)
Introduction of a workflow for applying amortized inference in modern user simulation models for HCI, with demonstrated ability to reduce computation costs significantly
(2)
Validating the approach across three published case studies from CHI conferences, comparing with known baselines, and reporting efficiency and accuracy measurements
(3)
Open-source release of the entire code implementation from the study
1 for future research into amortized inference in HCI models
2 Related Work
Simulation models represent latent processes that are theorized to play out when a user interacts with a computer. This sets the approach apart from mathematical modeling, which is employed most often to predict aggregate outputs such as task-performance metrics [
17,
64,
67,
71,
78] and from data-driven models, which consider the dataset rather than the user [
28,
46,
50]. Simulations based on a cognitive-modeling architecture (GOMS [
8], ACT-R [
1], EPIC [
36], etc.) represent depict processes as a system of interconnected processing modules. Early work in this field was restricted by the need for task-specific manual description of the program being executed within the given architecture. Recent work aimed at avoiding the limitations of such hand-coded rule systems applies machine-learning (ML) methods: a
control policy is learned automatically. The policy captures a user’s tendency to act in a particular way given a particular (internal or external) state. This approach’s principles fall under the concept of
computational rationality, the notion that those latent processes are organized in service of maximizing expected value via action [
20,
44,
53,
59]. Researchers have directed RL and, more recently, deep RL toward estimating computationally rational policies, with applications demonstrated in menu search [
10], point-and-click [
15], touchscreen typing [
30], button pressing [
54], mid-air pointing [
9,
16], learning of layouts [
32], visual search [
33], driving [
31], and multitasking [
18].
The question addressed in this paper is how to determine those parameters in a simulation model that represent unobservable characteristics of the user. Prior simulation models have either used values reported in the literature or set the parameters via fitting to a dataset. With our study, we sought to facilitate user simulation models’ construction and, furthermore, enable rapid individual-level simulation by developing and introducing a more efficient method for inferring parameters from real user behavior.
2.1 Inverse Modeling with User Simulations
The inverse modeling process ascertains the best model parameters for a given dataset. After inversion, one can approach the model’s parameters as hypotheses about the latent characteristics of interest. Where the model represents the user’s behavior via a simple closed-form expression, researchers can, from the parameters, calculate the likelihood of any given observation or a particular discrepancy between a prediction and what actually happened. Fitts’ law [
6,
7,
17,
42,
63], the diffusion model for binary decisions [
64,
65], the classic model for visual perception of speed [
71], and the integrated eye-movement model [
67] are some of the many traditional user models with parameters determined through MLE or least-squares estimation. This approach does not permit the simulation models’ inversion, however: since most user simulation models involve complex relationships without a known likelihood function, likelihood-based fitting methods (MLE etc.) cannot invert them. Hence, many previous studies adopt values from the literature or tune parameters manually.
ABC has emerged as a popular likelihood-free approach for inverting simulation models [
4,
11,
27]. It offers a systematic way of finding the parameters that replicate a given dataset well in the model’s parameter space and thus revealing the parameters’ posterior distribution from the observed data. Kangasrääsiö
et al. [
34] first attempted to apply ABC to fit an RL-based simulation of menu selection. However, its application to user simulation models is plagued by the time-efficiency problem. Running the multiple simulations required with various parameter candidates is time-consuming when the simulation model insists on optimizing its control policy anew for each candidate. Moon
et al. [
49] introduced a way of addressing this. They showed that the control policy of user simulation models can be generalized across varying model parameters by means of multi-task RL. Consequently, the ABC process, which had previously required days of computation [
34], may be reduced to one or two hours. Nonetheless, as is hardly surprising, individual-level fitting based on ABC is normally restricted to smaller numbers of users (e.g., 5 [
34] or 20 [
49]). Though one recent study [
18] of task-interleaving behavior employed ABC to fit an RL-based model to the data of 211 individuals, fitting this model to an individual user took 1.5 hours of computation time. Our motivation in delving into amortized inference was to see whether it could overcome these issues, thereby opening the door to larger-scale and real-time applications of inverse modeling in HCI.
2.2 Amortized Inference
Amortized inference consists of “investing upfront computation to support rapid online inference” [
72]. With modern ML methods, a complex probability distribution can be estimated through variational inference [
5,
38,
77], which uses optimization of tractable and parameterized distributions to approximate the target distribution. One can amortize this variational inference by building a neural proxy model, a
conditional density estimator that learns an effective mapping from a given observation (e.g., an individual’s behavior) to an approximate distribution (e.g., a posterior of model parameters) [
38]. Once trained through upfront computation, the conditional density estimator enables rapid inference of the posterior of the parameters conditioned on the given observation.
Researchers have examined many ML methods in efforts to solve this conditional density estimation problem [
11,
19,
45,
55]. One critical factor is the neural networks’ representation capacity – they must be able to learn the complex probability distribution. Early methods combined Gaussians, to estimate the posterior distribution, with trained neural-network models (e.g., mixture density networks), to predict the means and covariance of the Gaussians [
57,
62]. Though these methods performed sufficiently well for several inverse problems, it shows limited ability to represent more complex posterior distributions (multimodal ones especially). As a emerging approach,
normalizing flows [
58,
66] approximate complex distributions through sequential bijective transformation steps, starting with a simple normal distribution. Here, invertible neural networks (INNs) model the bi-directional conversion between the target distribution and a tractable one. This invertibility constitutes the unique advantage of normalizing flows over other variational inference models (e.g., a variational autoencoder [
38]). A tractable distribution that can be converted to the target distribution in a bijective manner renders it possible to calculate that target’s posterior density precisely and provides for easy sampling of the parameters from it. Both properties (tractable density and ease of sampling) are beneficial in the training and inference processes for density estimators. Therefore, recent research into amortized inference has welcomed normalizing flows [
2,
21,
23,
40,
58]. BayesFlow [
61] is a good example of recent work demonstrating the utility of normalizing flow models (RealNVP [
14]) for amortized inference. Non-HCI applications of amortized inference have exploited its flexibility; it is easy to apply to most simulation models yet does not necessitate critical assumptions [
13,
22,
60], but the HCI field has been devoid of attempts to amortize inference for simulation models, with the exception of Murray-Smith
et al.’s successful inference of a finger’s 3D pose from sensor data via a conditional variational autoencoder [
51]. We fill this gap with the first presentation of a general workflow for applying amortized inference to the HCI field’s simulation models.
3 Workflow for Amortized Inference
For presenting the workflow of applying amortized inference to the inverse problem of user simulation models, we formulate the inverse problem thus: Let us suppose there are three components. 1) A user simulation model outputs simulated user behavior when given the state of an interaction task as input, with the model’s parameters representing cognitive-physiological characteristics of a user; 2) prior distributions of the model parameters characterize prior knowledge of the plausible values referring to those user characteristics; 3) and observation data capture aspects of user behavior in the form of observational data collected from one or more users. To solve the inverse problem is to identify the posterior distribution of the model parameters via the given observation and arrive at the best estimated parameter values for observed users.
At the heart of amortized inference is training a conditional density estimator such that it can estimate the posterior density quickly with a given observation as input. Accordingly, the workflow can be summarized in terms of the following steps.
(1)
Simulating user behaviors: Training of the conditional density estimator requires a large dataset composed of ground-truth pairs: parameters and corresponding user behaviors. With a user simulation model, we generate sufficient numbers of user behaviors with various parameter combinations sampled from the prior distributions.
(2)
Building and training the density estimator: The density estimator is built with neural-network structures able to represent the parameters’ complex posterior distribution from the given high-dimensional observation data. We train the density estimator to approximate the true posterior by using the simulated dataset from step 1.
(3)
Inferring via the trained density estimator: Finally, we deploy the trained density estimator to infer the posterior of model parameters from actual user behaviors. Its inference process consumes very little computation time (e.g., tens of milliseconds).
While the first two steps are computationally expensive (requiring 1–2 days), these one-time actions enable rapid inference in step 3 for every new user’s data.
3.1 Notation
We express the operation of a user simulation model as y = f(x;θ), where y represents the model’s output (i.e., user behavior), x represents the input (i.e., task state), and θ represents the parameters (i.e., a user’s cognitive-physiological characteristics). The prior of model parameters is denoted by p(θ), and their posterior when behavior data are given is p(θ|y). We use the hat symbol (∧) to highlight variables for data estimated or predicted through models (e.g., \(\boldsymbol {\hat{\theta }}\) for the estimated model parameters and \(\boldsymbol {\hat{y}}\) for the predicted user behavior). Finally, the subscript “o” refers to data observed from real users, as in yo for the observed user-behavior data.
3.2 Simulation of User Behavior
With user simulation models, we can generate a training dataset for the density estimator, consisting of plausible sets of
θ and simulated
y under them, in the form of (
θ,
y) pairs. Recent simulation models reproduce user behavior (
y) through the sequential decision-making of an RL agent: the agent obtains partial (or noisy) information about the task environment’s state in keeping with an individual’s cognitive-physiological bounds (dictated by mechanisms’ limits [
47,
67,
71] and variable parameters,
θ) and chooses an action that achieves maximum utility. In RL terms, the simulated user’s internal decision-making process is formulated as a partially observable Markov decision process (POMDP). The agent’s decision-making function (i.e., control policy) can be implemented as a neural-network model that, from the observed state, ascertains the action, and deep RL techniques such as the deep Q-network (DQN) algorithm [
15,
30,
48] afford optimizing it. One problem that most current simulation models face here is that their control policy is usually optimized within fixed bounds (i.e., with fixed
θ) [
10,
15,
30]. To collect the training data under various
θ values, one must adjust the control policy for each
θ. Rather than take the time-consuming path of optimizing a control policy anew for each
θ, this workflow follows recent multi-task RL approaches; i.e., a single generalized control policy is trained for a family of tasks (solving for a POMDP family parameterized by
θ). Here, a neural-network-based policy model is implemented that receives
θ as external input alongside the observed task state. Modulated Q-network [
49] and optimal control ensembles [
41] exemplify such multi-task RL approaches’ recent use. By optimizing the policy model subject to the multi-task environment (i.e., episodes with various
θ), we are able to achieve a generalized optimal policy. In consequence, the training dataset can be simulated, where
θ ∼
p(
θ),
y =
f(
x;
θ).
3.3 Building and Training a Density Estimator
The conditional density estimator’s objective is rapid estimation of the posterior (
p(
θ|
y)) from given behavior data (
y). For the general case, it has to meet two crucial requirements. Firstly, it must deal with various data types present in user-behavior observations (e.g., a simple summary of multiple task trials or full details of each) and extract meaningful features from them as a fixed-size vector. Secondly, when given the extracted features, it should be capable of representing even highly complex distributions of
p(
θ|
y), including multiple modes (peaks). To address both requirements, we designed the density estimator to consist of an
encoder network and a
conditional INN (see Figure
2). The estimator learns bi-directional conversion between a latent variable (denoted as
z) from a normal distribution
\(\mathcal {N}(\boldsymbol {0}, \boldsymbol {{I}})\) and
θ from a complex distribution
p(
θ|
y), when given
y. Therefore,
θ =
gϕ(
z;
y) and
\(\boldsymbol {z}=g^{-1}_{\boldsymbol {\phi }}(\boldsymbol {\theta };\boldsymbol {y})\) , where
gϕ represents the forward conversion through the density estimator with neural-network weights
ϕ. Because the latent variable (
z) follows a normal distribution, it is easy to sample
z and calculate its exact density. Therefore, with the bijective transformation between
z and
θ, we can achieve easy sampling and density estimation for
p(
θ|
y) as well.
3.3.1 The encoder network.
An encoder network (Figure
2(b)) outputs a fixed-size feature vector (denoted by
\(\boldsymbol {\tilde{y}}\) ) from a given observation (
y). The design of this network can vary, depending on the type of
y. We consider the most basic
y first: simple summary statistics for a user’s multiple task trials (e.g., average completion time or task scores). In this case, the observation
y from each user has the same size, so the encoder network can be constructed with a multi-layer perceptron, or MLP (that is, multiple linear layers coupled with nonlinear activation). When, on the other hand,
y is the full set of observation data from multiple
i.i.d. task trials (e.g., all the completion times and task scores for each trial), the size for observation set (
y) can be varied by user on the basis of the number of trials completed. To extract a fixed-size output from this variable-size input in a permutation-invariant manner (i.e., without restrictions imposed by trial order), we utilize recently developed query–key–value (QKV) attention structures [
29,
76]. Supplement A.1 provides further details.
3.3.2 The conditional invertible neural network.
The conditional INN (Figure
2(c)) targets approximating
p(
θ|
y) in accordance with the feature vector of the given behavior data (
\(\boldsymbol {\tilde{y}}\) ). The INN models bijective transformation between
p(
θ|
y) and
p(
z) (=
\(\mathcal {N}(\boldsymbol {0}, \boldsymbol {{I}})\) ) when given
\(\boldsymbol {\tilde{y}}\) . To this end, it is designed with several
flow steps [
58,
66], each of which is an invertible transformation between inputs and outputs, especially conditioned on
\(\boldsymbol {\tilde{y}}\) . As flow steps stack more deeply, it is possible to express more complex distributions, starting from a normal distribution (observe how
p(
z) transforms to
p(
z1) → ⋅⋅⋅ →
p(
zn − 1) →
p(
zn) =
p(
θ|
y) over
n flow steps in Figure
2(c)). We implemented each flow step with a recent normalizing flow model,
Glow [
37] (see Supplement A.2 for details).
3.3.3 The training process.
The encoder network and the conditionally reversible network are trained once, from the data generated by the user simulation models
2 (see Figure
3(a)). The training’s objective is to minimize the Kullback–Leibler divergence (KLD) between true posterior
p(
θ|
y) and our estimated posterior
\(\hat{p}_{\boldsymbol {\phi }}(\boldsymbol {\theta }|\boldsymbol {y})\) ; hence, our goal is to find
ϕ*, the optimal neural-network weights for the density estimator that approximates the true posterior as closely as possible. That is,
Here, the objective function can be solved via stochastic gradient descent methods with the loss function
\(\mathcal {L}(\boldsymbol {\phi })\) below (the literature provides details of the derivation and its validated foundation [
61]). If we have a batch of
M pairs of true
θ and
y from the simulated training dataset (per Section
3.2), then
where
\(\text{det}\left(\boldsymbol {J}^{(i)}_{g_{\boldsymbol {\phi }}} \right)\) denotes the determinant of the Jacobian matrix
\(\partial g^{-1}_{\boldsymbol {\phi }}(\boldsymbol {\theta }; \boldsymbol {y})/\partial \boldsymbol {\theta }\) at (
θ(i),
y(i)). The function’s first term refers to a negative log of
\(\mathcal {N}(\boldsymbol {z} | \boldsymbol {0}, \boldsymbol {{I}})\) ; therefore, minimizing
\(\mathcal {L}(\boldsymbol {\phi })\) causes
z to follow
\(\mathcal {N}(\boldsymbol {0}, \boldsymbol {{I}})\) .
3.4 Inference with a Trained Density Estimator
With the trained density estimator, we can rapidly infer the posterior of model parameters (
θ) from real users’ observed behavior (
yo), as illustrated in Figure
3(b). Sampling the latent variable
z from
\(\mathcal {N}(\boldsymbol {0}, \boldsymbol {{I}})\) and computing
θ via
gϕ(
z;
yo) is equivalent to sampling
θ from its true posterior
p(
θ|
yo) if one assumes perfect convergence of density-estimator training [
61]. Therefore, obtaining a plausible set of
θ from the given
yo comes at very low computational cost, only that of passing the sampled
z and
yo through the trained density-estimator network, which can be performed on the order of milliseconds. With the set of plausible
θ candidates, we can approximate the full posterior distribution
p(
θ|
yo) by means of well-known methods, such as kernel density estimation (KDE), and determine the estimated values of model parameters
\(\boldsymbol {\hat{\theta }}\) , typically using maximum
a posteriori (MAP) estimation [
34].
3.5 Use of a Point Estimator
Should distribution information be unnecessary, one can simplify the estimator model’s structure by seeking only point-estimate values of the parameters. This can yield the benefit of decreased size of the neural-network model, thereby reducing the computation costs for training and inference. The workflow presented above supports training a
point estimator accordingly. Instead of a conditional INN in a density estimator, one can build a point estimator with any feedforward network (e.g., an MLP) that directly outputs a point estimate from a feature vector extracted from the encoder network. With the same training dataset, the process would use a general regression loss (e.g., mean squared error) rather than the training loss of Equation
1. The inference process would consist only of feeding forward the given observation data, without any need for sampling of the latent variable
z. Further on (in Section
8.3), we report the quantitative results from testing how using point estimators as opposed to density estimators affects inference performance.
5 Case 1: Menu Search
The menu-search model [
10] simulates a user’s behavior in searching for a target item in a linear menu interface. The goal of the simulated user is to either select the given target item (if it is present in the menu) or declare that the target item does not exist (if it is not). The simulated user can discover each item’s information by looking directly at it, by looking at an item above or below it (with peripheral vision), or through memory recall. Via sequential decision-making directed by an optimized control policy, the model replicates a user’s eye movements during menu search and, thereby, predicts the completion time of each trial (see prior publications for further details [
10,
34]). For our study, we decided to infer four of the menu-search model parameters, following the lines of earlier work [
34] that used ABC. These inferred parameters
θ are
•
dfix: the duration for eye fixation
•
dsel: the duration for item selection
•
prec: probability of recalling all items’ information during the first fixation period
•
psem: probability of investigating the item above or below the fixation target via peripheral vision
For the inverse modeling, we used a dataset of menu-search behavior [
3] from 21 users (13 of them female; mean age 21.1,
σ=3.54). It comprised the task-completion time and eye-movement data for each menu-search trial. From the full dataset, our study considered included only material from menu layouts that the simulation model covers (i.e., with eight items grouped by semantic similarity); hence, we employed a dataset of 18 participants’ behavior (150 trials per user, on average). We used the same observation data
y as in [
34] for model fitting: a 1-D vector containing the task-completion times’ mean and standard deviation values for each case, with the target present or absent. This represents one of the simplest and most common settings of inverse modeling problems in HCI, with user behavior visible only from aggregate data.
5.1 Implementation Details
With the workflow described in Section
3, we applied amortized inference as presented below.
5.1.1 Model simulation.
We generated a dataset of simulated behavior
y under the various values of
θ. While the original model used a control policy optimized for one fixed
θ, we generalized the policy to ensure optimal decision-making with each value of
θ, implementing the control policy via a modulated Q-network [
49], which can behave optimally for the various sets of
θ after training. Supplement B.1 elaborates on both the network and its training. During the model simulation,
θ is sampled from given prior distributions
p(
θ) for each trial. We chose the same
p(
θ) as the earlier work [
34], thus: the prior for
dfix is a truncated normal distribution where (mean, std, min, max) = (300 ms, 100 ms, 0 ms, 600 ms); the prior for
dsel is a truncated normal where (mean, std, min, max) = (300 ms, 300 ms, 0 ms, 1000 ms); that for
prec is uniform where (min, max) = (0.0, 1.0); and
psem’s prior is a uniform one where (min, max) = (0.0, 1.0). We sampled 262K distinct
θ values from
p(
θ) and simulated 256 trials for each sampled
θ. In total, the training dataset consists of 67M simulated trials. The entire simulation took 1.5 hours (in wall-clock time) with 16 CPU cores (AMD Ryzen 9 5950X, 3.5 GHz).
5.1.2 Density-estimator training.
Then, we built and trained the conditional density estimator by using the simulation dataset formed. In this menu-search scenario,
y is a simple one-dimensional fixed-size vector, so we opted for a simple MLP as the encoder network to extract
\(\boldsymbol {\tilde{y}}\) from given
y. We implemented the conditional INN to consist of five Glow steps, for all our cases (1–3). A batch of (
θ,
y), with size 512, was sampled from the dataset and used to compute the loss (Eq.
1) for gradient descent updating. We performed 600K training steps (i.e., gradient descent updates), with a total training time of approximately 10 hours on a standard workstation with the aforementioned CPU and an NVIDIA GeForce RTX 3080 GPU. Supplement B.2 gives further details of the density-estimator model implemented and its training.
5.1.3 Inference.
By sampling 1,000 latent variables
z from a unit normal distribution and taking
yo as input to the density estimator, we sampled 1,000
θ candidates. We estimated
p(
θ|
yo) by applying KDE with the sampled candidates for visualization, and we determined our
\(\boldsymbol {\hat{\theta }}\) as the MAP value of
p(
θ|
yo). The processes for cases 1–3 all applied the same inference procedure.
5.2 Evaluations
We evaluated the performance of amortized inference from the following two perspectives: how well the conditional density estimator can recover ground-truth model parameters from the simulated data (parameter recovery) and how well the parameter values inferred from the empirical user data can actually predict the user’s behavior (behavior prediction).
To judge the former, we randomly sampled 100 sets of θ from p(θ) and simulated y from them. Then, we measured the coefficient of determination (R2) between the ground-truth θ and the density estimator’s prediction ( \(\boldsymbol {\hat{\theta }}\) ). Here, R2 represents the proportion of the variation in the predicted \(\boldsymbol {\hat{\theta }}\) that can be explained by variation in the true θ. Hence, higher values are better; the range is 0 to 1, so R2 = 1 denotes a perfect prediction for θ.
Next, we assessed the model’s behavior-prediction performance with real users’ data when using the fitted parameters with amortized inference. We investigated the two scenarios for applying amortized inference: group-level and individual-level fitting. For group level, one model parameter set was estimated from the all-users dataset. In the individual-level fitting, on the other hand, model parameter sets were estimated for each individual user’s behavior data. We measured the prediction accuracy as the distance between the simulated behavior under the estimated parameters (
\(\boldsymbol {\hat{y}^\prime }\) ) and the actual behavior of the same users (
\(\boldsymbol {y_o^{\prime }}\) ).
3 We considered three distinct distance metrics: 1) mean difference, 2) KLD,
4 and 3) maximum mean discrepancy (MMD) [
24]. Mean difference captures the absolute difference between the mean values from the behavior features in the trials in each case (ground-truth observation vs. model prediction). In contrast, KLD and MMD estimate the difference between distributions. The closer KLD and MMD are to 0, the closer the behavior-feature distributions of
\(\boldsymbol {\hat{y}^\prime }\) and
\(\boldsymbol {y_o^{\prime }}\) are.
We measured the distance with regard to four features of menu-search behavior: each trial’s completion time and number of fixations, when the target is present and when it is absent (2 × 2). We compared the resulting fit (group- and individual-level) from amortized inference with the
baseline: use of the earlier model parameters fitted via ABC [
34]. This baseline represents applying conventional non-amortized inference to fit the parameters at group level.
5.3 Results
The inference process, run a single time for each dataset, took about 10 ms with our trained density estimator (i.e., both group- and individual-level fitting). Accordingly, individual-level inference for 18 users consumed only about 200 ms in total. This is a significant improvement over previous (ABC) methods, which took 37 h for group-level fitting for the same user dataset [
34].
Figure
5(a) depicts parameter-recovery performance with our trained density estimator, showing the correlation between the ground-truth values and our predicted values for each model parameter. The results show that the trained estimator predicts most parameters with a high coefficient of determination (
R2=0.74 for
dfix, 0.87 for
dsel, and 0.82 for
prec). It was more difficult to estimate
psem from the given behaviors (
R2=0.33).
Figure
5(b) shows the estimated posterior distribution from the actual behavior measured from all users. Our estimated posterior distribution closely matches the baseline model parameters, which lie within its high-probability region. Table
1 shows the distance between the observed behavior and the three model predictions. While the baseline, group-level, and individual-level model fits with amortized inference all replicate the observations reasonably well (see Figure
6 for the full histograms of observed and simulated behaviors over trials), from among the three it was our individual-level fitting that made the most accurate prediction, with the smallest distance by nine out of the 12 metrics (4 behavior features × 3 distance metrics). On the other hand, there was no noticeable difference between the baseline and our group-level fit, which showed the smallest distance by one and two metrics, respectively; that is, the amortizing approach offered inference performance comparable to ABC’s (in the same group-fitting cases) while vastly improving the computation cost (from 37 h to 10 ms).
6 Case 2: Point-and-click
Case 2’s point-and-click model [
15] simulates the user behavior of selecting a distant target on a computer screen by means of a mouse device. In each trial, one circular target with a random size appears at a random location onscreen and moves at a random but constant speed in a random direction. The goal of a simulated user is to click the mouse button with the cursor positioned within the target. The agent visually perceives the target’s and cursor’s information at every timestep. Proceeding from the perceived information, the simulated user iteratively updates its motor plan and decides, per decision-making by its control policy, whether to click during the motor plan. The original publication presents the simulation model more fully. In this study, we inferred four parameters of the model:
•
σv: visual perception noise
•
nv: signal-dependent motor noise
•
cσ: precision of one’s internal clock
•
Th, max: maximum length of the motor plan a user can build (i.e., prediction horizon)
In a previous study [
49], using ABC, the authors attempted to infer the first three parameters, each of which represents a user capability – for visual perception, motor movement, and click action, respectively. The final parameter (
Th, max) can represent a user’s motor-planning capability in addition.
For the inverse modeling, we used their dataset from the point-and-click behavior of 20 users (13 of them female; mean age 30.4,
σ=8.65) [
49]. The following data were collected for each trial: the user’s task performance (success/failure and completion time), the trial’s initial state (the radius, position, and velocity of the target and the cursor’s position and velocity), and the trajectories of the target and cursor (positions every 50 ms). We used
y encompassing all of the aforementioned data from 600 trials. This represents a more complex class of inverse modeling problem in HCI than Case 1, in that it involves complex user behaviors, of multiple types (with different semantics and data types), in each of the i.i.d. trials. In addition, the empirical data include values, measured participant-specifically through separate experiments, for the three cognitive-physiological capabilities (
σv,
nv, and
cσ) The previous study showed that inferred parameter values for the point-and-click model could describe the measured abilities.
6.1 Implementation Details
Most aspects of the implementation for amortized inference were the same as in Case 1. We implemented the control policy of the point-and-click model with a modulated Q-network, using the prior work’s network structure and optimization method [
49]. This generalization of the control policy enables single-model simulation of point-and-click behavior
y under the various values of
θ. Supplement C.1 details the control-policy structure and optimization.
The simulation’s prior distributions
p(
θ) too were determined from the earlier work [
49]: the prior for
σv is a log-uniform distribution where (min, max) = (0.069, 0.415); that for
nv is log-uniform with (min, max) = (0.145, 0.413); and that for
cσ is a log-uniform one where (min, max) = (0.055, 0.400). The previous report did not cite any prior knowledge of the range for
Th, max, so we used a simple uniform prior for it where (min, max) = (0.5 s, 2.5 s). We sampled 262K distinct
θ values from
p(
θ) and simulated 32 trials for each sampled
θ. In total, 8M simulated point-and-click trials were generated for density-estimator training. With 16 CPU cores, the simulation process took seven hours.
In contrast against Case 1’s y, consisting of only summary features across trials, Case 2’s consists of multiple trials’ data. The number of observed trials can be varied across users (i.e., y size can be varied). Accordingly, we implemented the attention-based encoder network to extract fixed-size features ( \(\boldsymbol {\tilde{y}}\) ) from variable-size observations (y). One million gradient descent steps (with batch size 32) were performed with the simulated training dataset. The total training time, approximately 30 h, was longer than Case 1’s because of the more complex observation data and network structure involved. Supplement C.2 elaborates on the density-estimator model implementation and training.
6.2 Evaluations
We evaluated the performance of our amortized inference from the same angles considered in Case 1: parameter recovery and behavior prediction (detailed in Section
5.2). With regard to the first, we assessed how well the trained density estimator could infer 100 sets of ground-truth parameter values from the simulated behaviors. For Case 2, the investigation additionally considered the estimator’s ability to recover real users’ measured cognitive-physiological capabilities from each user’s point-and-click behavior.
In this case too, we assessed behavior-prediction performance with regard to baseline, group-level, and individual-level fit. We measured the distance between the point-and-click behavior simulated via the fitted model parameters (
\(\boldsymbol {\hat{y}^\prime }\) ) and actual user behavior (
\(\boldsymbol {y_o^{\prime }}\) ) in terms of three features: 1) trial-completion time; 2) the cursor’s final click location relative to the target, with the target’s radius as the unit (hence, a value below 1 indicates that the click fell within the target); and 3) the cursor’s total travel distance in each trial. The baseline case for comparison came from simulated behavior where the participant-specific cognitive-physiological capability values [
49] furnished the model’s parameters; therefore, the baseline here represents obtaining each individual user’s unique characteristics to inform the simulation, through laborious measurement processes.
6.3 Results
Since each dataset’s inference process took 125 ms with our trained density estimator, individual-level inference for 20 participants’ datasets used just 2.5 s. In contrast, the previous study reported 30 h for the same individual-level inference (using ABC alongside a policy-modulation technique for rapid model simulation).
Figure
7 presents the parameter-recovery performance with the density estimator, for both simulated behavior data and actual users’ behavior data. With the former, the density estimator predicts the model parameters with high (
R2=0.94 for
σv, 0.67 for
cσ, and 0.96 for
Th, max) and moderate (
R2=0.38 for
nv) coefficients of determination. From the actual user behavior, it was possible to predict their cognitive-physiological capability values moderately well (
R2 = 0.37 for
nv and 0.61 for
cσ) and at low levels (
R2=0.23 for
σv). This inference performance was comparable with that of the ABC-based prior technique [
49], which estimated the same participants’ capability values
σv and
cσ moderately well (with
R2 values of 0.50 and 0.62, respectively) but failed to infer
nv. The
R2 values for the inferred
σv are slightly lower in our amortized inference case, but we succeeded in inferring
nv and achieved a huge speed improvement via amortized inference (to 2.5 s from 30 h).
Figure
8 plots the estimated posterior distributions of each model parameter from using our density estimator on the aggregated behavior data of all participants. The averaged values of the participants’ measured capabilities (
σv,
nv, and
cσ) lie well within the high-probability region of the estimated posterior distributions. Table
2 shows the distance between the actual observation and the three model predictions: baseline, group-level, and individual-level fits (Figure
9 presents the full histogram for each simulated behavior). From among the three models, the individual-level fit (with amortized inference) showed the results closest to actual user behavior by the majority of metrics (5 out of the 9). That is, the individual-level model fit outperformed the baseline method, in which the parameters were determined in line with each individual user’s measured capabilities.
7 Case 3: Touchscreen Typing
The touchscreen-typing model examined, by Jokinen
et al. [
30], simulates user behavior of typing on a cell phone or other touchscreen device with a finger. Here, the simulated user’s goal is to reach optimal typing performance by allocating two limited resources: finger movement and visual attention. Sharing these resources, the simulated user can either type a character in the target sentence or proofread the text produced so far (as dictated by the control policy’s decisions). If proofreading reveals an error, the model simulates the finger movement to correct the typed text (i.e., backspace over it). The finger-movement simulation balances speed against accuracy (faster finger movements are noisier) by applying a weighted homographic (WHo) model [
25]. Thus, the model can simulate aggregate typing performance (e.g., typing speed and accuracy), error-correction processes during a trial, and finger and eye movements over time. Further details can be found in [
30]. Our study inferred the following three parameters:
•
pobs: probability of noticing an error directly from finger movement (without visual attention to the typed text)
•
α: finger movements’ speed–accuracy bias (for WHo model)
•
k: all motor resources for finger movement (for WHo model)
In the original publication, the authors adopted values of
α (0.6) and
k (0.12) from prior literature [
69], and the
pobs value was hand-tuned (to 0.7).
Our inverse modeling employed a dataset much larger than those in cases 1–2. It comprises 37,000 users’ touchscreen-typing behaviors with mobile devices [
56]. We filtered this for only data from English-speaking users who typed with one finger used a QWERTY layout, and we also excluded trials wherein participants employed intelligent text-entry techniques. In consequence, the material for parameter fitting in Case 3 consisted of 14,277 trials’ data, from 1,057 users (665 of them female; mean age 28.10,
σ=10.67), with 13.5 trials per user, on average. As in Case 2,
y consisted of the behavior data from all the i.i.d. trials. From each typing trial, we used the following data: 1) words per minute (WPM), obtained by dividing typed-text word count by trial-completion time; 2) error rate, calculated by dividing the Levenshtein distance [
43] between the given and typed text by the longer one’s length; 3) the number of backspace presses per trial; 4) keystrokes per character (KSPC), consisting of the number of press inputs divided by that of characters typed; 5) and the length of the given text piece.
7.1 Implementation Details
The control policy of the original touchscreen typing model, composed of two distinct network models (actor and critic networks), was obtained by proximal policy optimization (PPO) [
70]. We generalized this control policy to operate under a range of model parameters, in the same manner as for the modulated Q-network used in Case 1 and 2. We implemented and trained both networks to take given model parameters (
θ) in addition to the observed task state as inputs. Such a method of conditioning for actor-critic networks on the basis of given parameters has been verified [
41] under the name “optimal control ensembles.” Supplement D.1 gives further details. The prior distributions of the model parameters were set such that the one for
pobs is a uniform distribution where (min, max) = (0, 1); the prior for
α is a truncated normal where (mean, std, min, max) = (0.6, 0.3, 0.4, 0.9); and that for
k is a truncated normal where (mean, std, min, max) = (0.12, 0.08, 0.04, 0.20). We sampled 66K distinct
θ values from
p(
θ) and simulated 16 trials for each one sampled. Therefore, 1M simulated typing trials were obtained for density-estimator training. The simulation process took 14 hours for 16 CPU cores.
Since
y consists of multiple trials’ data (as in Case 2), we implemented our attention-based encoder network to extract fixed-size features from the variable-size observations; for the training, we used 600K gradient descent steps (batch size = 100), and training took 15 h in all. Supplement D.2 provides details.
7.2 Evaluations
As with the other cases, we assessed parameter recovery (the accuracy of inferring 100 sets of ground-truth model parameters from the given simulated behavior) and behavior prediction (the accuracy of replicating real users’ behavior through simulation of group- and individual-level fitted models). For evaluating the latter, we measured the distance between the real user’s behavior and the model’s prediction, considering the following four behavior features: 1) WPM, 2) error rate, 3) backspace presses, and 4) KSPC. We used behavior simulated on the basis of the previously reported model parameters [
30] as a baseline for comparison. In that the original study borrowed the parameter values from prior literature without fitting the model to empirical datasets, the baseline in Case 3 represents common practice in the HCI field. With Case 3, we showcase one of the promising areas for applying high-computational-efficiency amortized inference: large-scale analysis of user data. With reference to the large-scale dataset [
56], we can report on how the inferred cognitive parameters of individuals are distributed in a population (
N=1,057), in terms of two demographic factors: age and gender.
7.3 Results
The inference process took about 15 ms per given behavior dataset. Accordingly, the time for the individual-level inference of 1,057 users’ data with amortized inference was only 20 seconds in total.
Figure
10(a) depicts the parameter-recovery performance from the simulated behavior with our trained density estimator. Our trained estimator predicted all three model parameters with a high level of coefficient determination (
R2=0.82 for
pobs, 0.71 for
α, and 0.96 for
k). Figure
10(b) plots the estimated posterior distributions from the aggregated behavior data of all 1,057 users. The baseline model parameters (adopted from the literature) fall within the estimated posterior distribution but not its high-probability region. We interpret this as related to a limitation of using values imported from the literature: they are not always entirely representative of the dataset, and other confounding factors may be present. Our inferred parameters’ superiority to the literature-derived baseline ones for behavior prediction may support this interpretation. Table
3 attests that, by most metrics, our group-level fitted model parameter values replicate real user behavior better than the baseline parameter values do. Furthermore, fitting the model at individuals’ level yielded even better prediction performance, with the predictions best approaching real user behavior. The full set of behavior histograms is reproduced in Figure
11.
Figure
12 shows how the parameter values inferred for each individual participant are distributed population-wise in terms of both age and gender. We detected a significant correlation between the users’ age and inferred
k, which represents the motor resources behind a user’s finger (Pearson’s
r=0.18, with
p<0.001). Therefore, data from typing behavior might reveal that younger people show more extensive finger-related motor resources (i.e., lower
k). This result is consistent with previous work [
69]. We found no significant age correlation for the other two parameters,
pobs (probability of noticing an error from finger movement) and
α (the movement’s speed–accuracy bias), with
p-values of 0.580 and 0.088, respectively. On the other hand, regarding gender, we discovered a significant difference between male and female user groups’ inferred
k values (independent-samples
t-test:
t=3.20, with
p<0.001). This result revealed finger motor resources to be stronger in female than in male users, even though the latter were older (the female users’ mean age being 29.75 years and males’ 25.10). The other two parameters did not differ significantly in either respect between the gender groups (
p=0.151 for
pobs; 0.457 for
α).
9 Discussion
Across multiple cases, we have demonstrated four key benefits of amortized inference in inverse modeling for HCI research:
•
High predictive accuracy for individual users: Our reported levels of accuracy were on par with or better than with non-amortized approaches (e.g., ABC).
•
Computational efficiency: Fitting a model to an individual user’s data, which used to demand hours or days (with ABC), took only 10–125 ms for the same data (cases 1–2).
•
Large-scale parameter inference: Thanks to high efficiency, amortized inference aids in estimating how model parameters are distributed in a population and how they correlate with factors not accounted for in the model (e.g., age and gender in Case 3).
•
A principled approach to uncertainty: As ABC does but unlike regular optimization-based approaches such as Nelder–Mead [
52], inverse modeling with amortized inference affords estimating posterior distributions of model parameters. This holds value for HCI work with small-scale or noisy observations of users.
The following review of core aspects of the case studies’ results discusses our findings’ implications and what might affect performance.
Inference efficiency.
The complexity of the behavioral data behind inference (
y) affects computational efficiency. In Case 1’s conditions,
y consists of summary features, 1-D data with only a feature dimension. The other cases had a trial dimension added (since all data, from multiple trials, were used). Case 2 involved a time dimension additionally (because each trial’s data included a time-series trajectory), so it had the most complex
y (3-D data, with trial, time, and feature dimensions). This necessitated a more complicated encoder network, which, in turn, increased the inference time. Nonetheless, inference time did not grow with the number of observed trials in
y (Figure
13(a)), thanks to the parallel computation of neural networks (i.e., multiple trials’ computations can be processed as a batch).
Parameter recovery.
We discovered that parameter-recovery performance may differ not only between models but also within the parameters of a model. While our trained density estimator could achieve decent recovery performance for most parameters, a few parameters showed lower
R2 values (e.g.,
psem in Case 1 and
nv in Case 2). One influential factor is the
identifiability of the model parameters. If there are multiple solutions for a certain parameter that could produce the same model output, that parameter can be difficult to isolate from the given data. This is sometimes due to the model itself (e.g., model sloppiness [
26]) or small quantities of given data. One can put amortized inference to use for rapidly revealing gaps in the given data as one examines the convergence of parameter-recovery performance with increasing dataset sizes. For example, from Figure
13(b) we can identify the number of observed trials at which performance starts to converge; we need at least 16 trials to recover
k with nearly optimal accuracy and 128 trials for
α in Case 3. The
approximation gap and
amortization gap are two additional elements that could influence parameter-recovery performance [
12]. The approximation gap refers to error resulting from limits to the density estimator’s representation power; that is, it results from our attempt to approximate the true posterior distribution by using a surrogate (parameterized) distribution. The amortization gap refers to error caused by training a density estimator for the entire dataset, as opposed to optimizing an approximate distribution for each individual user’s data.
Individual-level fitting.
Amortized inference fitted the parameters better at the level of individuals than at that of the whole user group. Across all three cases, parameters fitted at individuals’ level showed the greatest accuracy. That said, there were exceptions (e.g., individual-level fitting was not superior to group-level fitting for backspacing in Case 3). This might be a result of constraints in the behavior space that the simulation model can express; for example, some behavioral features reproduced by the model may display inherent mutual dependencies. If the constraints are not aligned with real users’ behavior space, it will be impossible to replicate their behavior features in full. Then, one must find the closest possible point instead.
Scaling up the number of users.
We were able to infer model parameters for 1,057 individuals by means of amortized inference, whereas the previous highest number was 211, with ABC [
18]. The inference process for all users took only 20 seconds in total. This level of efficiency is remarkable and signals that the method scales up well. It affords analysis that captures even smaller effects in datasets and can connect them with parameters of a simulation model. The previously unreported difference we revealed between male and female users’ motor resources in finger movement (
k) is a case in point (Figure
14).
Robustness to distributional shifts.
The density estimator showed some robustness to distribution shifts between training and test data, especially in dealing with data from unseen task conditions (e.g., the pointing target’s speed in Case 2) or different priors. Still, to achieve better inference performance against empirical data, one needs to hone the priors from the literature and make the training data as comprehensive as possible. Cases of other types can inform testing of density-estimator robustness. Questions remain with regard to, for instance, training data generated by mechanisms that do not perfectly match real-world data.
Point estimator vs. density estimator.
Our findings suggest that the two estimator types manifest a tradeoff relationship. A point estimator can further reduce the computation cost of inference, yet a density estimator might be more beneficial in several cases: 1) rigorously checking the reliability of the inferred values, 2) identifying multiple plausible explanations for observations, and 3) using the inferred posterior as a new prior to perform multi-round inference that is expected to achieve higher accuracy (e.g., in sequential neural posterior estimation [
23,
57]).
9.1 Applications
Together, the robustness and low computation cost brought by amortized inference offer an exciting vista for future application of simulation models in 1) adaptive user interfaces, 2) recommendation systems, 3) diagnostic tools for user modeling, and 4) large-scale behavior analysis tools.
For the first of these, parameters inferred for a user could form the basis for adapting an interface optimally for the individual. For example, simulation models aid in optimizing a keyboard layout for people with such impairments as dyslexia or tremor [
68]. Adaptation of this sort via simulation models has been handled mainly on a non-real-time basis thus far [
68,
73,
74]. Our approach could open the door to developing adaptive user interfaces in a new way, enabling repetitive processes of inference from observations and optimization in real time. Secondly, amortized inference could advance interactive recommendation systems through inferring a user’s temporal intention and interest. For instance, future simulation models could assist in better disentangling various underpinnings of click data (individuals’ preference, intention, curiosity, etc.). As for developing user models with theory-based mechanisms, parameter inference enables a proposed model to be evaluated with actual user data, and amortized inference can speed up the investigation. Also, parameter identifiability is often of interest to modelers; the shape of the inferred posterior distribution informs how well each parameter can be distinguished via what is observable. With this approach, one can ensure the validity of datasets in future experiments and identify the number of observations required for valid inferences. Finally, low-cost inverse modeling enables applying simulation models to identify humans’ latent capabilities from datasets orders of magnitude larger than feasible ever before. For example, by inversely modeling an individual’s behavior, one could recover the attributes of the user’s motor and cognitive systems – muscle activation, memory recall, and many more.