1 Introduction

Math learners can now turn to a wide variety of freely available online resources, from Khan Academy to Massive Open Online Courses (MOOCs). However, many of these resources cannot completely reproduce features of in-person tutoring, like giving students the sense that they are engaged in a back-and-forth exchange with a tutor, tailored feedback, and guidance about how to allocate their attention between reading explanations and practicing problems. Existing online math platforms have recently moved towards these desiderata with features like personalized feedback and guidance. For example, online math homework tools like ASSISTments (Heffernan & Heffernan, 2014; Feng et al., 2009) give feedback on common wrong answers. Further, online resources like MathTutor (Aleven et al., 2009a) build on example-tracing tutors (Aleven et al., 2016), which model the progression of a lesson with a behavior graph that: (1) outlines potential student actions, such as providing common incorrect responses; and (2) specifies the feedback, explanation, or new problem that should follow those actions. That approach aims to reduce development time while achieving some of the benefits of intelligent tutoring systems for mathematics, like personalized selection of problems (Nye et al., 2018; Falmagne et al., 2013; Craig et al., 2013; Winkler et al., 2020).

One consequence of online math education shifting from static media to adaptive intelligent tutoring systems is the dramatic increase in potential for personalization of the platform. When developing interactive platforms, a content designer must choose an appropriate pedagogical strategy: for example, whether the topic should be conveyed through conceptual lessons or practice problems, and the degree to which feedback should be provided. To choose an optimal pedagogical strategy for every new piece of content, one could turn to cognitive and educational experts and draw from educational theory, such as aptitude treatment interaction (Snow, 1989). In practice, however, it can be difficult to operationalize such theories to create effective strategies (Zhou et al., 2017). Furthermore, the large number of avenues for personalization, some of which may not have been previously investigated in the literature, along with the data available in online platforms suggests a more computational approach for learning personalized pedagogical policies.

The traditional method to compare the efficacy of various policies is to run a randomized A/B experiment. However, running such an experiment may not be feasible or desirable in adaptive education platforms due to high exploration costs: many users may be assigned to a bad pedagogical policy before the experiment is over, leading to deleterious effects on their learning experience. An alternative to traditional randomized experiments is the contextual bandit, a popular technique from the reinforcement learning (RL) literature (Li et al., 2010). Compared to traditional A/B tests, bandit algorithms can often learn personalized strategies with substantially less experimentation, leading to improved user experiences.

Our paper builds upon the aforementioned intelligent tutoring systems, moving from adaptive platforms to an actual conversational interface that closely mimics some key facets of conversation with a human tutor. Specifically, we designed and evaluated a prototype chatbot system, which we call MathBot. To achieve conversational flow and mirror the experience of interacting with a human tutor, we paid close attention to the timing of prompts and also incorporated informal language and emoji. As with a human tutor, the MathBot system alternates between presenting material and gauging comprehension. MathBot also provides learners with personalized feedback and guidance in the form of explanations, hints, and clarifying sub-problems. Finally, we built into MathBot the capability of learning personalized pedagogical policies via both contextual bandits and randomized experiments, allowing us to compare the two strategies in a live deployment.

To evaluate MathBot, we carried out three user studies on Amazon Mechanical Turk. The first study sought to determine whether users preferred to use MathBot over comparable online resources and, through qualitative feedback from users, elucidate potential avenues for improving MathBot through personalization. At a high level, we found that users were polarized in their preferences, with about half preferring MathBot. Specifically, 116 participants completed (in a randomized order) both an abridged lesson about arithmetic sequences with MathBot and a video on Khan Academy covering similar content; these participants then rated their experiences. We found that 42% of users preferred learning with MathBot over the video, with 20% indicating a strong preference. An additional 110 participants completed the same abridged lesson with MathBot along with a written tutorial from Khan Academy containing embedded practice problems. In this case, 47% of these users preferred learning with MathBot over the written tutorial, with 18% stating a strong preference. While MathBot was not preferred by the majority of our participants, our results point to potential demand for conversational agents among a substantial fraction of learners.

The second study sought to determine whether MathBot produced learning gains on par with comparable online resources. We randomized 369 participants to either complete a full-length conversation with MathBot about arithmetic sequences or complete a set of videos and written tutorials from Khan Academy covering similar content. To test their knowledge, each subject took an identical quiz before and after completing their assigned learning module. Under both conditions, participants exhibited comparable average learning gains and learning times: 65% improvement for MathBot, with a mean learning time of 28 min (SD = 20), and 60% improvement given Khan Academy material, with a mean learning time of 29 min (SD = 22); we note that the difference in learning gain was not statistically significant.

Given that a subset of users indeed preferred MathBot to conventional learning tools, we explored the potential of contextual bandits to learn personalized pedagogical policies for MathBot in the third and main study. For this experiment, we recruited 405 participants to complete a full-length conversation with MathBot about arithmetic sequences. Unlike the first two studies, in which the possible conversation paths were the same for each user, the third study leveraged a version of MathBot that could choose, for each user, whether or not to present certain conceptual lessons and also whether or not to provide certain supplemental practice questions. We randomized participants between two experimentation strategies—one in which actions were chosen by a contextual bandit and another in which actions were randomly chosen (i.e., an A/B design)—with the ultimate goal of reducing learning time without reducing learning gains. This goal was motivated by feedback from users in the first two studies, some of whom commented that the pacing of the lesson was too slow and some too fast, suggesting that personalizing the speed of the lesson could be beneficial. We found that, during experimentation, users assigned to the contextual bandit condition took less time (mean difference of 149 seconds, with a 95% confidence interval of [32, 266]) to complete the lesson and were less likely to drop out, despite scoring equivalently on the post-learning assessment as those assigned to the A/B design condition. Finally, we compared the quality of the learned post-experimentation policies using offline policy evaluation techniques, finding no statistically significant difference between the quality of the policies learned by the contextual bandit and the randomized experiment.

In summary, our contributions are threefold: (1) MathBot, a prototype system that adds conversational interaction to learning mathematics through solving problems and receiving explanations; (2) a live deployment of a contextual bandit in a conversational educational system; (3) evidence that a contextual bandit can continuously personalize an educational conversational agent at a lower cost than a traditional A/B design.

2 Related work

We briefly review past work on building chatbots, conversational tutoring systems, example-tracing tutors, and other intelligent tutoring systems (ITSs). We also survey the use of reinforcement learning algorithms in these systems.

2.1 Chatbots

Chatbots have been widely applied to various domains, such as customer service (Xu et al., 2017), college management (Bala et al., 2017), and purchase recommendation (Horzyk et al., 2009). One approach to building a chatbot is to construct rule-based input-to-output mappings (Al-Rfou et al., 2016; Yan et al., 2016). One can also embed chatbot dialogue into a higher-level structure (Bobrow & Winograd, 1977) to keep track of the current state of the conversation, move fluidly between topics, and collect context for later use (Walker & Whittaker, 1990; Seneff, 1992; Chu-Carroll & Brown, 1997). We envisioned MathBot as having an explicit, predefined goal of the conversation along with clear guidance and control of intermediate steps, so we took the approach of modeling the conversation as a finite-state machine (Raux & Eskenazi, 2009; Quarteroni & Manandhar, 2007; Andrews et al., 2006), where user responses update the conversation state according to a preset transition graph.

2.2 Conversational tutors in education

Conversational tutors in education often build complex dialogues. For example, one might ask students to write qualitative explanations of concepts (e.g., A battery is connected to a bulb by two wires. The bulb lights. Why?) and initiate discussions based on the responses (Graesser et al., 2001). AutoTutor and its derivatives (Nye et al., 2014; VanLehn et al., 2002; Graesser et al., 1999, 2004) arose from Graesser et al. (1995) investigation of human tutoring behaviors and modeled the common approach of helping students improve their answers by way of a conversation. These systems rely on natural language processing (NLP) techniques, such as regular expressions, templates, semantic composition (VanLehn et al., 2002), LSA (Graesser et al., 1999; Person, 2003), and other semantic analysis tools (Graesser et al., 2007). Nye et al. (2018) added conversational routines to the online mathematics ITS ALEKS by attaching mini-dialogues to individual problems but left navigation to be done via a website. MathBot aims to have the entire learning experience take place through a text conversation, giving the impression of a single tutor. More broadly, MathBot differs from past work on NLP-based conversational tutors in that it explores the possibility of reproducing part of the conversational experience without handling extensive open-ended dialogue, potentially reducing development time.

2.3 Intelligent tutoring systems and example-tracing tutors

A wide range of intelligent tutoring systems in mathematics use precise models of students’ mathematical knowledge and misunderstandings (Ritter et al., 2007; VanLehn, 1996; Aleven et al., 2009a, b; O’Rourke et al., 2015). To reduce the time and expertise needed to build ITSs, some researchers have proposed example-tracing tutors (Koedinger et al., 2004; Aleven et al., 2009b, 2016). Specifically, example-tracing tutors allow content designers to specify the feedback that should appear after students provide certain answers and then record those action-feedback pairs in a behavior graph (Aleven et al., 2016). Using the Cognitive Tutor Authoring Tools (CTAT), Aleven et al. (2009a, b) built MathTutor, a suite of example-tracing tutors for teaching 6th, 7th, and 8th grade math. Our work draws on insights from example-tracing tutors in that we build a graph which encodes rules that determine how MathBot responds to specific student answers, though our approach differs in that we display these responses in a conversational format.

2.4 Learning pedagogical strategies with bandits

To allow MathBot to personalize elements during live deployment, we incorporate a contextual multi-armed bandit algorithm (Lai & Robbins, 1985; Li et al., 2010), a tool from reinforcement learning for discovering which actions are effective in different situations (contexts). Other reinforcement learning approaches have been applied in education, typically for offline learning. Ruan et al. (2019) increase student performance by combining adaptive question sequencing with a NLP-based conversational tutor for teaching factual knowledge, but use a combination of random selection and a probabilistic model of learners’ knowledge of particular items to order questions. Lee et al. (2014) describe a framework to learn personalized pedagogical policies for DragonBox Adaptive, a K–12 math puzzle platform, without the support of an expertly-designed cognitive model. Chi et al. (2011) use another popular technique from RL to learn an effective pedagogical strategy for making micro-decisions, such as eliciting the next step of the problem versus revealing it, in an NLP-based ITS teaching college-level physics. Lan and Baraniuk (2016) describe a contextual bandit framework to assign students to an educational format and optimize performance on an immediate follow-up assessment, but evaluate the performance of the framework offline and do not personalize the actual lessons. A key difference between these studies and MathBot is that it is rare to use these strategies online in a live educational deployment. Only a handful of studies have begun to explore live deployments for sequencing problems (Clement et al., 2015; Segal et al., 2018), and none that we are aware of do so to learn which actions to take in a conversation.

3 MathBot system design and development

MathBot allows users to learn math topics through conversation-style interaction, rather than simply browsing online resources like videos, written lessons, and problems. Below we give an illustrative example of a learner interacting with MathBot, describe MathBot’s front-end of an interactive chat, and outline its back-end of a conversation graph which specifies the rules by which it progresses through concepts and chooses actions to take based on user responses.

3.1 Sample learner interaction with MathBot

Suppose a student, Alice, wants to learn about arithmetic sequences by interacting with MathBot. To start the interaction, MathBot greets Alice and asks her to extend the basic sequence “2, 4, 6, 8 \(\ldots\)”. Alice correctly answers “10”, so MathBot provides positive feedback (e.g., “Good work! ”) and begins a conceptual explanation of recognizing patterns in sequences. MathBot asks Alice if she is ready to complete a question to check her understanding, and Alice responds affirmatively. Alice progresses successfully through a series of additional explanations and questions.

Following an explanation of common differences, Alice is asked a new question: “What’s the common difference of 2, 8, 14, 20, \(\ldots\)?”. Figure 1 displays the conversation rules that underlie Alice’s current question. When asked the new question, Alice confuses the term “common difference” with “greatest common factor”, a topic she recently reviewed, so she answers “2”. MathBot recognizes that Alice has made a mistake and subsequently checks that she knows how to identify terms in a sequence and subtract them, a prerequisite task for finding the common difference (Fig. 1ii). Alice answers correctly, so MathBot begins to ask her a series of additional sub-questions to further clarify the concept of common differences (Fig. 1iii). Alice successfully completes these sub-questions, so MathBot directs her back to the original question. Alice remembers learning that the common difference is the difference between consecutive terms, though she mistakenly subtracts 8 from 2 and answers “I think it’s −6”. Rather than have Alice finish a redundant series of sub-questions, MathBot recognizes that Alice has made a common mistake, subsequently provides specific feedback to address that mistake, and then allows Alice to retry the original question (Fig. 1iv). Alice answers the original question correctly and proceeds to a new question on identifying decreasing arithmetic sequences (Fig. 1v).

Fig. 1
figure 1

Example section of MathBot’s conversation graph. Ellipses (\(\ldots\)) denote excised sections of the full conversation graph. Marked blocks (i)–(v) denote actions taken by a hypothetical user, Alice, in Sect. 3.1

3.2 MathBot’s front-end chat and back-end conversation graph

The front-end of MathBot is a text chat window between MathBot and the student (Fig. 2a, b). Students type replies to MathBot to give answers to problems, providing responses like “I’m not sure”. Students can freely scroll through the chat history to review explanations or questions.

Drawing inspiration from example-tracing tutors (Koedinger et al., 2004; Aleven et al., 2009b, 2016), the MathBot back-end consists of a conversation graph that specifies a set of if-then rules for how learner input (e.g., “I’m ready” or “The answer is 6”) leads to MathBot’s next action (e.g., give a new problem or provide feedback). In this rule-based system, the state of the conversation is represented as a finite state machine (FSM). In this FSM, each state is a response provided by MathBot, and user responses route the user along different paths in the conversation graph. For example, the question asked at the top of Fig. 1 is a state, and responses to that question (e.g., “I don’t know” or “6”) route users to a new state. MathBot uses fuzzy matching and basic string equivalence to parse responses and route users appropriately.

Fig. 2
figure 2

Example snippets of MathBot conversations

4 Evaluating MathBot

We first validate MathBot in two studies comparing it to Khan Academy, a high-quality, free, and widely-used online resource for math tutorials and problems that delivers content in a non-conversational format. In the first study, we investigate user preferences between the two platforms and solicit qualitative feedback on what users liked and disliked about MathBot. In the second study, we compare the learning efficacy of the two platforms. In the third and main study, we leverage qualitative feedback from the first two studies to design personalized improvements to MathBot’s pedagogical policy.

4.1 Design of Study 1

In the first part of this within-subject study, we ask participants on Amazon Mechanical Turk to interact with MathBot and watch a 6-min Khan Academy video, and then solicit feedback on the two learning methods (Fig. 3). Despite their lack of interactivity, Khan Academy videos are competitive baselines, as they are carefully tailored by expert instructors and are demonstrably effective for teaching mathematical content (Weeraratne & Chin, 2018).

Fig. 3
figure 3

Study design of the first part of Study 1, which measured preferences for video-based instruction versus instruction via MathBot

We conduct the second part of the study identically, except we recruit new users and replace the video with a written tutorial from Khan Academy containing embedded practice problems (Fig. 4). This second comparison provides an additional layer of insight, as one might conjecture that any result favoring MathBot over video instruction may simply be the result of MathBot providing an interface to work through problems.

Fig. 4
figure 4

Study design of the second part of Study 1, which measured preferences for instruction via written tutorial versus instruction via MathBot

To limit the length of the study, we use an abridged version of our developed MathBot content that covers only explicit formulas for arithmetic sequences, and pair that with either a Khan Academy video or a written tutorial that covers similar material. To avoid ordering effects—including anchoring bias and fatigue—we randomized the order in which participants saw MathBot and the Khan Academy video or written tutorial. Tables 2 and 3 in the "Appendix" summarize user attrition and filtering, which were similar across conditionsFootnote 1. After accounting for user attrition and the filtering criteria, 116 participants remained in the first part of the study and 111 participants in the second part. Tables 4 and 5 in the "Appendix" summarize the demographics of the filtered set of users. Our analysis is restricted to this filtered set of users.

4.2 Quantitative results

After study participants completed the MathBot and Khan Academy learning modules, we asked them a series of questions to quantify their experiences. In particular, we asked participants to answer the following question on a 7-point scale ranging from “strongly prefer” MathBot to “strongly prefer” the Khan Academy material: “If you had 60 min to learn more about arithmetic sequences and then take a quiz for a large bonus payment, which of these two scenarios would you prefer? 1. Interact with an expanded version of the conversational computer program, then take the quiz. 2. [Watch more videos / Complete more interactive tutorials] about arithmetic sequences, then take the quiz.” We note that the ordering of options 1 and 2 was randomized for each user.

The responses to this question for the first part of the study are presented in Fig. 5a. We found that 42% of participants stated at least a weak preference for MathBot, 53% stated at least a weak preference for Khan Academy videos, and 5% indicated a neutral preference. The corresponding results for the second part of the study are displayed in Fig. 5b. In that case, we found that 47% of the 110 participants who answered the question stated at least a weak preference for MathBot, 44% stated at least a weak preference for Khan Academy interactive tutorials, and 9% stated a neutral preference. Tables 6, 7, 8, and 9 in the "Appendix" summarize the experiential ratings and time-on-task of participants in Study 1.

Overall, more of our participants preferred Khan Academy materials to MathBot—a testament to the quality of Khan Academy. The highly polarized response distribution, however, also illustrates the promise of new forms of instruction to address heterogeneous learning preferences. Indeed, 20% of users in the first part of the study and 18% of users in the second part expressed a “strong preference” for MathBot over Khan Academy material.

Fig. 5
figure 5

Distributions of user preferences among the participants of Study 1. “M” denotes MathBot, “V” denotes video, and “T” denotes tutorial. Each “+” indicates a stronger preference, and “\(\sim\)” indicates a neutral choice. Preferences for MathBot and Khan Academy are highly polarized, suggesting that the needs of learners could be better met by offering both modes of instruction

4.3 Qualitative results

After each part of the study, we asked users to respond to the following prompt: “Please compare your experience with the conversational computer program and the [video / interactive tutorial]. In what scenarios could one learning method be more effective or less effective than the other?” We analyzed the resulting comments to identify themes and understand users’ perspectives on MathBot and the Khan Academy videos and written tutorials. One author conducted open coding to identify common themes addressed by each response. Another author verified the coded labels and resolved conflicts with discussion. We discuss the coded categories at length in the Appendix, but highlight one theme in particular, that of pacing, here. We found that different users expressed different sentiments about the pacing of the lessons. For example, one participant noted, “as it gets more complicated, the lesson should slow down a bit,” while another indicated, “I felt like the teaching went too slow for me.” We return to this feedback later on, seeking to address it via personalization, slowing down or speeding up the conversation for each learner as appropriate.

4.4 Design of Study 2

We next sought to evaluate whether MathBot produced comparable learning outcomes to Khan Academy material. To assess educational gains, we randomly assigned participants to learn about arithmetic sequences via: (1) a full-length MathBot conversation; or (2) a combination of Khan Academy videos and written tutorials covering the same content as the MathBot conversation. We assessed learning outcomes with a 12-question quiz, giving the same quiz before and after each participant completed the learning moduleFootnote 2. Similar filtering criteria to Study 1 resulted in our analyzing 182 subjects assigned to MathBot and 187 assigned to Khan Academy materials. Table 10 in the "Appendix" summarizes user attrition and filtering in Study 2, and Table 11 summarizes user demographics (Fig. 6).

Fig. 6
figure 6

Experimental design of Study 2, which measured learning gains achieved by instruction via MathBot versus instruction via Khan Academy videos and written tutorials

4.5 Results

We start by computing the proportional learning gain (PLG) for each subject. To calculate PLG, we first determine the raw learning gain by subtracting the pre-learning quiz score from the post-learning quiz score. We divide this result by the maximum possible score increase, defined as the difference between the maximum possible post-learning score (12) and the user’s pre-learning score. Figure 7 shows the distribution of the PLG. We find the average PLG for MathBot users is 65%, with a 95% confidence interval of [58%, 72%]; the corresponding average PLG for Khan Academy users is 60%, with a 95% confidence interval of [53%, 67%]. The gains from MathBot are slightly higher than those from Khan Academy, but the difference is not statistically significant (two-mean t-test, p = 0.15, 95% CI: [-2%, 12%]). MathBot and Khan Academy users spent comparable time completing the learning modules—28 min on average for MathBot (SD = 20) and 29 min for the Khan Academy videos and written tutorials (SD = 22). Table 12 in the "Appendix" summarizes raw learning outcomes of participants in Study 2, and Table 13 in "Appendix" summarizes performance on individual questions in the pre- and post-learning assessments.

Fig. 7
figure 7

Distributions of proportional learning gain (PLG) for users of MathBot and Khan Academy in Study 2. The distributions are similar for users in both conditions

5 Learning a pedagogical policy

Here we return to feedback from users in Study 1 who expressed mixed sentiments about the pacing of MathBot and address their concerns by learning a personalized pedagogical policy for pacing. Given that the MathBot conversation is structured as a series of lessons, each consisting of a conceptual explanation followed by an assessment question, we could potentially adjust pacing of a lesson in one of four ways: (1) show the conceptual explanation and show an isomorphic practice question before the assessment question (slowest); (2) show the conceptual explanation but skip the isomorphic practice question; (3) skip the conceptual explanation but show the isomorphic practice question; and (4) skip the conceptual explanation and skip the isomorphic practice question (fastest). Figure 8 illustrates these four actions.

We took a data-driven approach to learning a personalized pedagogical strategy that selects between these four actions for each user and question. We specifically chose to use a contextual bandit, a tool from the reinforcement learning literature which balances exploring actions whose payoffs are unclear with exploiting actions whose payoffs are believed to be high (Li et al., 2010). For each user and question, the bandit selects one of the four above actions based on the user’s pre-learning quiz score (the context). For example, the algorithm might learn to speed up the conversation for users with high pre-learning quiz scores and slow it down for those with low scores. We note that we had access to many more contextual features than the pre-learning quiz score, such as scores on individual quiz items and self-reported academic history of study participants. However, to best mimic a real-life learning scenario where a tutor has access to only a coarse measure of prior knowledge, such as a grade in a prior course, we choose to use the pre-learning quiz score as the sole context.

Fig. 8
figure 8

Potential actions taken by the contextual bandit before each assessment question. The bandit chooses whether or not to show a conceptual explanation and whether or not to show an isomorphic practice question

To train a contextual bandit, we must not only specify the actions but also the objective function (the reward) over which the algorithm will optimize.Footnote 3 Recall that our motivation for improving our pedagogical strategy was to personalize the pacing of the lesson, with the goal of either slowing down the chat to boost comprehension or speeding it up without sacrificing learning. These dual desiderata suggest defining our reward as a linear combination of the total time spent on a lesson and an indicator of whether the user gets the assessment question correct on their first try:

$$\begin{aligned} 150 \cdot \mathbf{1 }_{\text {correct}} - \text {seconds spent on lesson}. \end{aligned}$$

In other words, we assume it is worth 150 seconds of extra time spent on a lesson to turn a student who would have answered the assessment question incorrectly into a student who answered the question correctly. In particular, we expected the lesson to take around 30 min for 12 concepts, giving 2.5 min (or 150 seconds), for each concept. It bears emphasis that the precise form of the reward function should be set by domain experts and depends on the situation. For example, in a setting where a chatbot was augmented by a human tutor, we might increase the relative worth of time compared to correctness to account for the opportunity cost of having the concept explained by the tutor. Finally, we note that our reward is defined at the level of an individual lesson: later, we consider whether the contextual bandit’s strategy is also optimizing a global reward defined at the level of the entire learning session.

5.1 Design of Study 3

Our goal is to assess the value of using a contextual bandit to learn a personalized pedagogical strategy for students. We benchmark the bandit to a common alternative: a regression fit on data from users who were randomly assigned to one of the four possible actions before each assessment question. That is, in the benchmark approach, we first conduct an exploration phase, in which we assign users to the four actions uniformly at random; then, we fit a regression on the collected data to learn a personalized policy. The bandit, in contrast, aims to better manage exploration by down-weighting actions that are learned to be ineffective.

To carry out this comparison, we first recruited 30 participants from Amazon Mechanical Turk and assigned them to each of the four actions at random, independently for each question. Data from this pilot phase were used to provide the bandit a warm start. We then randomly assigned the remaining participants to either: (1) the contextual bandit condition; or (2) the uniform random condition (Fig. 9).

Fig. 9
figure 9

Experimental design of Study 3, which investigated whether a contextual bandit could learn a personalized pedagogical policy for MathBot at a lower cost than a randomized A/B design

We use the same criteria as in Study 2 to filter participants before they interact with MathBot. These filtering criteria resulted in 228 subjects assigned to MathBot with a uniform random policy and 239 assigned to MathBot with a contextual bandit policy. We note that both groups include the 30 participants from the pilot phase: they are included in the uniform random group since their actions were given uniformly at random, and they are included in the bandit group as the bandit learned its initial policy from those individuals. Table 14 in the “Appendix” summarizes user dropout during experimentation, and Table 15 summarizes learning outcomes.

5.2 Results

We examine the behavior of the contextual bandit algorithm along three dimensions: (1) its degree of personalization; (2) the quality of the final learned pedagogical policy; and (3) the cost of exploration. We found that the bandit learned a personalized policy comparable in quality to the one learned on the uniform random data but, importantly, did so while imposing less burden on users.

5.2.1 Personalization

We begin by examining the pedagogical policy ultimately learned by the contextual bandit (i.e., the policy the bandit believed to be the best at the end of the experiment, after seeing 239 participants). Averaged over all questions, the final, learned policy assigns approximately 30% of users to each of the concept-only, isomorph-only, and no-concept-no-isomorph conditions; the remaining 10% are assigned to the concept-plus-isomorph condition.Footnote 4 In Fig. 10, we disaggregate the action distribution by question, showing the result for 4 representative questions out of the 11 total. The plot shows that the bandit is indeed learning a policy that differs substantially across users and questions. For only 3 of the 11 questions does the bandit determine that it is best to use the same action for every user—though even in these cases, each of the 3 questions have different selected actions.

Fig. 10
figure 10

For four representative assessment questions out of the eleven total, the proportion of users for which the final policy learned by the bandit would use each action. The policy chooses different actions based on each user’s pre-learning quiz score

5.2.2 Quality of learned solution

Next, we compare the expected reward of the learned policy from the bandit to that of the learned policy from the uniform random condition.Footnote 5 For the uniform random condition, we consider three different regression models: (1) a model with two-way interactions between actions and questions (effectively learning a constant policy per question); (2) a model with the same specification as the bandit, which is able to personalize based on pre-learning quiz score; and (3) a lasso regression that includes eight contextual covariates: pre-learning quiz score, accuracy on the previous question, time since starting the learning session, whether in the previous concept they were shown the conceptual explanation and/or isomorphic practice question, the time they spent on the previous concept, and the speed at which they set MathBot to send responses. Of these three models, the first performs the best.

We find that the bandit learned a policy which is comparable to the most successful policy learned from the uniform random condition, and further, that both the bandit and uniform random strategies learned a policy which outperformed the original policy from Studies 1 and 2 of always showing the concept without an isomorphic practice question. In Table 1, we display the average expected rewards of the two learned policies, along with the four policies which use the same action constantly. In particular, we find no statistically significant difference between the average reward obtained by the final bandit policy and the policy learned from the uniform random data. A 95% confidence interval for this difference in average rewards is [− 11, 28], slightly in favor of the policy learned in the uniform random condition. Finally, we note one additional advantage of the contextual bandit, which is that it can continually refine its learned solution given additional users, whereas traditionally the pedagogical policy would be fixed after concluding the uniform random experiment.

Table 1 For Study 3, average expected reward per question with 95% confidence intervals for the final policy learned from the bandit and uniform random conditions

5.2.3 The cost of exploration

The above results indicate that one can indeed learn a personalized pedagogical policy using a contextual bandit that is on par with one learned from uniform random data. The primary value of a bandit, however, is that it incurs lower costs of exploration by quickly learning which actions are unlikely to be beneficial. We thus now directly compare the average rewards obtained under the bandit and uniform random conditions during the model-learning period. Higher average reward during model-learning suggests users are having a better experience, as they receive sub-optimal actions less often.

We first compute the average reward for each lesson in both conditions, and then average that quantity over all the lessons for both conditions. This gives us the the average reward per lesson per user for both conditions. As shown in Fig. 11 (left panel), the average reward in the contextual bandit condition is substantially higher than in the uniform random condition. A 95% confidence interval on the difference is [9.6, 29.1].Footnote 6 In Fig. 12 in the "Appendix" we plot the cumulative average over the total local rewards per user as a function of the number of users in each condition, finding that the bandit quickly improves upon the uniform random policy in terms of average reward.

Fig. 11
figure 11

Average local (left) and global (right) rewards during the experiment for the bandit and uniform random conditions with 2 standard errors (1 SE solid, 2 SE dashed). In both cases, the contextual bandit obtains higher rewards, suggesting that it provides improved user experience while learning an optimal pedagogical policy

As another way to assess the cost of exploration, we compute the average value of a global reward function across users in our two conditions—bandit and uniform random. Analogous to the local reward function, the global reward is defined as:

$$\begin{aligned} 150 \cdot \mathbf{Post}-learning\,Quiz\,Score - \text {seconds spent on MathBot}. \end{aligned}$$

In contrast to the local reward function, the global reward considers the total post-learning quiz score and total time spent on the entire MathBot conversation, rather than correctness and time spent during individual lessons.

Figure 11 (right panel) shows the average global rewards of participants between the two conditions. We find that the bandit obtains considerably higher average global rewards than the uniform random condition, with the difference being 171 (95% CI [18–324], p = 0.029). We note further that the difference is mostly driven by users in the bandit condition taking far less time to finish the MathBot conversation. Table 15 breaks down the average learning gains and lesson times for users in the two conditions. The average difference in lesson times is 149 seconds (95% CI [32–266], p = 0.012), translating to the bandit saving around 12% of the uniform policy’s lesson time, while users in both conditions scored roughly the same on the post-learning quiz, with the difference being .2 questions (95% CI [-.53–.93], p = .59), slightly in favor of the bandit. The bandit was only designed to optimize for local rewards, so this result offers further evidence that the bandit is learning a generally effective policy.

As a final way to assess user satisfaction during exploration, we examine the difference in dropout rates between the two conditions. A user is said to “drop out” if they complete the first MathBot lesson but not the final lesson, either skipping to the post-learning quiz or leaving the experiment. Out of the participants in the bandit condition, 9% dropped out, compared to 15% of participants in the uniform random condition—a statistically significant gap of 6 percentage points (two-proportion z-test, \(p<0.05\)). This result again suggests the bandit provides an improved user experience while learning a pedagogical policy.

6 Discussion

6.1 Limitations

One potential shortcoming of MathBot and similar conversational tutoring systems is the time needed to develop and test the underlying conversation graph. On the other hand, since MathBot does not require researchers to develop NLP algorithms and models for conversation, it has one of the strengths of example-tracing tutors: those without extensive machine-learning expertise, including high-school instructors, could feasibly participate in development. In addition, Study 3 demonstrated the successful use of contextual bandit algorithms to learn how to personalize elements of the conversation graph—specifically, when to skip conceptual explanations and when to give additional practice problems. This result provides one demonstration of how such components could be learned via a data-driven process after deployment, further minimizing the development time.

It is worth discussing whether our success in using a contextual bandit to learn a pedagogical policy might generalize to other learning scenarios. One of the main theoretical concerns of using a contextual bandit in learning scenarios is that it may not be able to optimally handle long-term dependencies (e.g., skipping the first conceptual explanation hurts performance on the eighth concept). Much work has thus explored more complicated strategies for learning personalized pedagogical strategies which require more data (Chi et al., 2011; Ruan et al., 2019). In particular, we point out two features of our setting which are actually encouraging in this respect: (1) the lesson contains many concepts, most of which build upon one another, and (2) our bandit, despite being designed with a local reward function, was still able to learn more effectively than a uniform random policy even when evaluated with a global reward function. These two points of evidence suggest that bandits, despite theoretical concerns, may still have value in learning pedagogical policies even in complex and path-dependent learning scenarios.

An important limitation of our study is that we evaluated MathBot using a convenience sample of adults from Amazon Mechanical Turk. While Mechanical Turk workers have been shown to exhibit similar video-watching behavior and quiz performance as MOOC learners (Davis et al., 2018), it would be valuable to test our system with a population actively exposed to algebra instruction, such as high school students or remedial adult learners in college. Our study also does not address the implications of using MathBot as a major component of a full-length course. For example, we did not investigate knowledge retention, and we do not know whether students would enjoy using MathBot less or more if they used it to learn over the course of several weeks or months. One potential upside of using MathBot over a longer period of time is that if students changed in their aptitude, MathBot would automatically adjust its lessons to that student. Since MathBot’s applicability to a classroom setting is yet to be explored, future work could consider how this approach would be received and used by teachers. For example, would MathBot be most useful as homework, as an optional supplementary resource, or as in-class practice?

Additionally, our system taught a single algebra topic, arithmetic sequences, with a conversation intended to last approximately 30 min (Studies 2 and 3) and could be less than 10 min (Study 1). Furthermore, because Khan Academy is an independent platform, we were unable to deeply investigate video-watching and tutorial-completing behavior of participants in Studies 1 and 2. Further investigation is necessary to understand exactly which of our insights might generalize to other learning scenarios, including longer interaction periods, different topics in mathematics, and different learning formats such as games (Lee et al., 2014).

6.2 Conclusion

In this work, we developed and studied the effect of an interactive math tutoring system: MathBot. Although the content of MathBot closely matched that of the Khan Academy materials, we found evidence of heterogeneous learning preferences. MathBot produced learning gains that were somewhat higher than those of Khan Academy, though the gap was not statistically significant. Finally, we found that a contextual bandit was able to efficiently learn a personalized pedagogical policy for showing extra practice problems and skipping explanations to appropriately alter the pace of the MathBot conversation, outperforming a randomized experiment.

We note several directions for further work. We found that the bandit was able to learn as effective a policy as the randomized A/B experiment while requiring less time: however, we did not study what might happen in a setting where the time of the lesson was fixed and the bandit instead had to optimize learning gains given the fixed time allotted for the lesson. Additionally, given the challenge of fully exploring a substantial number of actions and a sizable context space with only a limited number of interactions with real users, the contextual bandit used in Study 3 had access to only four actions with one contextual variable. If a future iteration of MathBot were released to a larger audience, the bandit could explore additional actions, such as entirely skipping topics or providing more than one additional practice question, and could leverage additional contextual variables, such as users’ stated preferences for learning via conceptual explanations versus example problems, or individual pre-quiz answers. Furthermore, the choice of learning media itself could be personalized with either a contextual bandit or another technique from the reinforcement learning literature: one could certainly imagine specific students or concepts being better suited for conversation than video or vice-versa.

Several users in Study 1 noted the benefit of interacting with multiple learning modules, and past work has demonstrated that prompting users with relevant questions periodically during a video may improve learning outcomes (Shin et al., 2018). Accordingly, one could explore integrating brief conversations with MathBot into educational videos or, conversely, video elements could be used in the MathBot conversation. Though MathBot interactively guides learners through explanations and relevant questions, it does not provide a platform for extensive rote practice after finishing the conversation. An adaptive question sequencing model such as DASH [29] could be used to guide students through an optimized sequence of practice problems by accounting for student performance during the MathBot conversation. We hope that future work will investigate the potential of intelligent tutoring systems that incorporate multiple modes of teaching and learn to personalize themselves to individual student needs.