Bandit algorithms to personalize educational chatbots

William Cai ORCID: orcid.org/0000-0002-2190-2500¹,
Josh Grossman¹,
Zhiyuan Jerry Lin¹,
Hao Sheng¹,
Johnny Tian-Zheng Wei²,
Joseph Jay Williams³ &
…
Sharad Goel¹

3862 Accesses
20 Citations
2 Altmetric
Explore all metrics

Abstract

To emulate the interactivity of in-person math instruction, we developed MathBot, a rule-based chatbot that explains math concepts, provides practice questions, and offers tailored feedback. We evaluated MathBot through three Amazon Mechanical Turk studies in which participants learned about arithmetic sequences. In the first study, we found that more than 40% of our participants indicated a preference for learning with MathBot over videos and written tutorials from Khan Academy. The second study measured learning gains, and found that MathBot produced comparable gains to Khan Academy videos and tutorials. We solicited feedback from users in those two studies to emulate a real-world development cycle, with some users finding the lesson too slow and others finding it too fast. We addressed these concerns in the third and main study by integrating a contextual bandit algorithm into MathBot to personalize the pace of the conversation, allowing the bandit to either insert extra practice problems or skip explanations. We randomized participants between two conditions in which actions were chosen uniformly at random (i.e., a randomized A/B experiment) or by the contextual bandit. We found that the bandit learned a similarly effective pedagogical policy to that learned by the randomized A/B experiment while incurring a lower cost of experimentation. Our findings suggest that personalized conversational agents are promising tools to complement existing online resources for math education, and that data-driven approaches such as contextual bandits are valuable tools for learning effective personalization.

A Rule-Based Chatbot Offering Personalized Guidance in Computer Programming Education

How to Design and Deliver Courses for Higher Education in the AI Era?

Effective and Scalable Math Support: Experimental Evidence on the Impact of an AI-Math Tutor in Ghana

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Math learners can now turn to a wide variety of freely available online resources, from Khan Academy to Massive Open Online Courses (MOOCs). However, many of these resources cannot completely reproduce features of in-person tutoring, like giving students the sense that they are engaged in a back-and-forth exchange with a tutor, tailored feedback, and guidance about how to allocate their attention between reading explanations and practicing problems. Existing online math platforms have recently moved towards these desiderata with features like personalized feedback and guidance. For example, online math homework tools like ASSISTments (Heffernan & Heffernan, 2014; Feng et al., 2009) give feedback on common wrong answers. Further, online resources like MathTutor (Aleven et al., 2009a) build on example-tracing tutors (Aleven et al., 2016), which model the progression of a lesson with a behavior graph that: (1) outlines potential student actions, such as providing common incorrect responses; and (2) specifies the feedback, explanation, or new problem that should follow those actions. That approach aims to reduce development time while achieving some of the benefits of intelligent tutoring systems for mathematics, like personalized selection of problems (Nye et al., 2018; Falmagne et al., 2013; Craig et al., 2013; Winkler et al., 2020).

One consequence of online math education shifting from static media to adaptive intelligent tutoring systems is the dramatic increase in potential for personalization of the platform. When developing interactive platforms, a content designer must choose an appropriate pedagogical strategy: for example, whether the topic should be conveyed through conceptual lessons or practice problems, and the degree to which feedback should be provided. To choose an optimal pedagogical strategy for every new piece of content, one could turn to cognitive and educational experts and draw from educational theory, such as aptitude treatment interaction (Snow, 1989). In practice, however, it can be difficult to operationalize such theories to create effective strategies (Zhou et al., 2017). Furthermore, the large number of avenues for personalization, some of which may not have been previously investigated in the literature, along with the data available in online platforms suggests a more computational approach for learning personalized pedagogical policies.

The traditional method to compare the efficacy of various policies is to run a randomized A/B experiment. However, running such an experiment may not be feasible or desirable in adaptive education platforms due to high exploration costs: many users may be assigned to a bad pedagogical policy before the experiment is over, leading to deleterious effects on their learning experience. An alternative to traditional randomized experiments is the contextual bandit, a popular technique from the reinforcement learning (RL) literature (Li et al., 2010). Compared to traditional A/B tests, bandit algorithms can often learn personalized strategies with substantially less experimentation, leading to improved user experiences.

Our paper builds upon the aforementioned intelligent tutoring systems, moving from adaptive platforms to an actual conversational interface that closely mimics some key facets of conversation with a human tutor. Specifically, we designed and evaluated a prototype chatbot system, which we call MathBot. To achieve conversational flow and mirror the experience of interacting with a human tutor, we paid close attention to the timing of prompts and also incorporated informal language and emoji. As with a human tutor, the MathBot system alternates between presenting material and gauging comprehension. MathBot also provides learners with personalized feedback and guidance in the form of explanations, hints, and clarifying sub-problems. Finally, we built into MathBot the capability of learning personalized pedagogical policies via both contextual bandits and randomized experiments, allowing us to compare the two strategies in a live deployment.

To evaluate MathBot, we carried out three user studies on Amazon Mechanical Turk. The first study sought to determine whether users preferred to use MathBot over comparable online resources and, through qualitative feedback from users, elucidate potential avenues for improving MathBot through personalization. At a high level, we found that users were polarized in their preferences, with about half preferring MathBot. Specifically, 116 participants completed (in a randomized order) both an abridged lesson about arithmetic sequences with MathBot and a video on Khan Academy covering similar content; these participants then rated their experiences. We found that 42% of users preferred learning with MathBot over the video, with 20% indicating a strong preference. An additional 110 participants completed the same abridged lesson with MathBot along with a written tutorial from Khan Academy containing embedded practice problems. In this case, 47% of these users preferred learning with MathBot over the written tutorial, with 18% stating a strong preference. While MathBot was not preferred by the majority of our participants, our results point to potential demand for conversational agents among a substantial fraction of learners.

The second study sought to determine whether MathBot produced learning gains on par with comparable online resources. We randomized 369 participants to either complete a full-length conversation with MathBot about arithmetic sequences or complete a set of videos and written tutorials from Khan Academy covering similar content. To test their knowledge, each subject took an identical quiz before and after completing their assigned learning module. Under both conditions, participants exhibited comparable average learning gains and learning times: 65% improvement for MathBot, with a mean learning time of 28 min (SD = 20), and 60% improvement given Khan Academy material, with a mean learning time of 29 min (SD = 22); we note that the difference in learning gain was not statistically significant.

Given that a subset of users indeed preferred MathBot to conventional learning tools, we explored the potential of contextual bandits to learn personalized pedagogical policies for MathBot in the third and main study. For this experiment, we recruited 405 participants to complete a full-length conversation with MathBot about arithmetic sequences. Unlike the first two studies, in which the possible conversation paths were the same for each user, the third study leveraged a version of MathBot that could choose, for each user, whether or not to present certain conceptual lessons and also whether or not to provide certain supplemental practice questions. We randomized participants between two experimentation strategies—one in which actions were chosen by a contextual bandit and another in which actions were randomly chosen (i.e., an A/B design)—with the ultimate goal of reducing learning time without reducing learning gains. This goal was motivated by feedback from users in the first two studies, some of whom commented that the pacing of the lesson was too slow and some too fast, suggesting that personalizing the speed of the lesson could be beneficial. We found that, during experimentation, users assigned to the contextual bandit condition took less time (mean difference of 149 seconds, with a 95% confidence interval of [32, 266]) to complete the lesson and were less likely to drop out, despite scoring equivalently on the post-learning assessment as those assigned to the A/B design condition. Finally, we compared the quality of the learned post-experimentation policies using offline policy evaluation techniques, finding no statistically significant difference between the quality of the policies learned by the contextual bandit and the randomized experiment.

In summary, our contributions are threefold: (1) MathBot, a prototype system that adds conversational interaction to learning mathematics through solving problems and receiving explanations; (2) a live deployment of a contextual bandit in a conversational educational system; (3) evidence that a contextual bandit can continuously personalize an educational conversational agent at a lower cost than a traditional A/B design.

2 Related work

We briefly review past work on building chatbots, conversational tutoring systems, example-tracing tutors, and other intelligent tutoring systems (ITSs). We also survey the use of reinforcement learning algorithms in these systems.

2.1 Chatbots

Chatbots have been widely applied to various domains, such as customer service (Xu et al., 2017), college management (Bala et al., 2017), and purchase recommendation (Horzyk et al., 2009). One approach to building a chatbot is to construct rule-based input-to-output mappings (Al-Rfou et al., 2016; Yan et al., 2016). One can also embed chatbot dialogue into a higher-level structure (Bobrow & Winograd, 1977) to keep track of the current state of the conversation, move fluidly between topics, and collect context for later use (Walker & Whittaker, 1990; Seneff, 1992; Chu-Carroll & Brown, 1997). We envisioned MathBot as having an explicit, predefined goal of the conversation along with clear guidance and control of intermediate steps, so we took the approach of modeling the conversation as a finite-state machine (Raux & Eskenazi, 2009; Quarteroni & Manandhar, 2007; Andrews et al., 2006), where user responses update the conversation state according to a preset transition graph.

2.2 Conversational tutors in education

Conversational tutors in education often build complex dialogues. For example, one might ask students to write qualitative explanations of concepts (e.g., A battery is connected to a bulb by two wires. The bulb lights. Why?) and initiate discussions based on the responses (Graesser et al., 2001). AutoTutor and its derivatives (Nye et al., 2014; VanLehn et al., 2002; Graesser et al., 1999, 2004) arose from Graesser et al. (1995) investigation of human tutoring behaviors and modeled the common approach of helping students improve their answers by way of a conversation. These systems rely on natural language processing (NLP) techniques, such as regular expressions, templates, semantic composition (VanLehn et al., 2002), LSA (Graesser et al., 1999; Person, 2003), and other semantic analysis tools (Graesser et al., 2007). Nye et al. (2018) added conversational routines to the online mathematics ITS ALEKS by attaching mini-dialogues to individual problems but left navigation to be done via a website. MathBot aims to have the entire learning experience take place through a text conversation, giving the impression of a single tutor. More broadly, MathBot differs from past work on NLP-based conversational tutors in that it explores the possibility of reproducing part of the conversational experience without handling extensive open-ended dialogue, potentially reducing development time.

2.3 Intelligent tutoring systems and example-tracing tutors

A wide range of intelligent tutoring systems in mathematics use precise models of students’ mathematical knowledge and misunderstandings (Ritter et al., 2007; VanLehn, 1996; Aleven et al., 2009a, b; O’Rourke et al., 2015). To reduce the time and expertise needed to build ITSs, some researchers have proposed example-tracing tutors (Koedinger et al., 2004; Aleven et al., 2009b, 2016). Specifically, example-tracing tutors allow content designers to specify the feedback that should appear after students provide certain answers and then record those action-feedback pairs in a behavior graph (Aleven et al., 2016). Using the Cognitive Tutor Authoring Tools (CTAT), Aleven et al. (2009a, b) built MathTutor, a suite of example-tracing tutors for teaching 6th, 7th, and 8th grade math. Our work draws on insights from example-tracing tutors in that we build a graph which encodes rules that determine how MathBot responds to specific student answers, though our approach differs in that we display these responses in a conversational format.

2.4 Learning pedagogical strategies with bandits

To allow MathBot to personalize elements during live deployment, we incorporate a contextual multi-armed bandit algorithm (Lai & Robbins, 1985; Li et al., 2010), a tool from reinforcement learning for discovering which actions are effective in different situations (contexts). Other reinforcement learning approaches have been applied in education, typically for offline learning. Ruan et al. (2019) increase student performance by combining adaptive question sequencing with a NLP-based conversational tutor for teaching factual knowledge, but use a combination of random selection and a probabilistic model of learners’ knowledge of particular items to order questions. Lee et al. (2014) describe a framework to learn personalized pedagogical policies for DragonBox Adaptive, a K–12 math puzzle platform, without the support of an expertly-designed cognitive model. Chi et al. (2011) use another popular technique from RL to learn an effective pedagogical strategy for making micro-decisions, such as eliciting the next step of the problem versus revealing it, in an NLP-based ITS teaching college-level physics. Lan and Baraniuk (2016) describe a contextual bandit framework to assign students to an educational format and optimize performance on an immediate follow-up assessment, but evaluate the performance of the framework offline and do not personalize the actual lessons. A key difference between these studies and MathBot is that it is rare to use these strategies online in a live educational deployment. Only a handful of studies have begun to explore live deployments for sequencing problems (Clement et al., 2015; Segal et al., 2018), and none that we are aware of do so to learn which actions to take in a conversation.

3 MathBot system design and development

MathBot allows users to learn math topics through conversation-style interaction, rather than simply browsing online resources like videos, written lessons, and problems. Below we give an illustrative example of a learner interacting with MathBot, describe MathBot’s front-end of an interactive chat, and outline its back-end of a conversation graph which specifies the rules by which it progresses through concepts and chooses actions to take based on user responses.

3.1 Sample learner interaction with MathBot

Suppose a student, Alice, wants to learn about arithmetic sequences by interacting with MathBot. To start the interaction, MathBot greets Alice and asks her to extend the basic sequence “2, 4, 6, 8 $\ldots$”. Alice correctly answers “10”, so MathBot provides positive feedback (e.g., “Good work! ”) and begins a conceptual explanation of recognizing patterns in sequences. MathBot asks Alice if she is ready to complete a question to check her understanding, and Alice responds affirmatively. Alice progresses successfully through a series of additional explanations and questions.

Following an explanation of common differences, Alice is asked a new question: “What’s the common difference of 2, 8, 14, 20, $\ldots$?”. Figure 1 displays the conversation rules that underlie Alice’s current question. When asked the new question, Alice confuses the term “common difference” with “greatest common factor”, a topic she recently reviewed, so she answers “2”. MathBot recognizes that Alice has made a mistake and subsequently checks that she knows how to identify terms in a sequence and subtract them, a prerequisite task for finding the common difference (Fig. 1ii). Alice answers correctly, so MathBot begins to ask her a series of additional sub-questions to further clarify the concept of common differences (Fig. 1iii). Alice successfully completes these sub-questions, so MathBot directs her back to the original question. Alice remembers learning that the common difference is the difference between consecutive terms, though she mistakenly subtracts 8 from 2 and answers “I think it’s −6”. Rather than have Alice finish a redundant series of sub-questions, MathBot recognizes that Alice has made a common mistake, subsequently provides specific feedback to address that mistake, and then allows Alice to retry the original question (Fig. 1iv). Alice answers the original question correctly and proceeds to a new question on identifying decreasing arithmetic sequences (Fig. 1v).

3.2 MathBot’s front-end chat and back-end conversation graph

The front-end of MathBot is a text chat window between MathBot and the student (Fig. 2a, b). Students type replies to MathBot to give answers to problems, providing responses like “I’m not sure”. Students can freely scroll through the chat history to review explanations or questions.

Drawing inspiration from example-tracing tutors (Koedinger et al., 2004; Aleven et al., 2009b, 2016), the MathBot back-end consists of a conversation graph that specifies a set of if-then rules for how learner input (e.g., “I’m ready” or “The answer is 6”) leads to MathBot’s next action (e.g., give a new problem or provide feedback). In this rule-based system, the state of the conversation is represented as a finite state machine (FSM). In this FSM, each state is a response provided by MathBot, and user responses route the user along different paths in the conversation graph. For example, the question asked at the top of Fig. 1 is a state, and responses to that question (e.g., “I don’t know” or “6”) route users to a new state. MathBot uses fuzzy matching and basic string equivalence to parse responses and route users appropriately.

4 Evaluating MathBot

We first validate MathBot in two studies comparing it to Khan Academy, a high-quality, free, and widely-used online resource for math tutorials and problems that delivers content in a non-conversational format. In the first study, we investigate user preferences between the two platforms and solicit qualitative feedback on what users liked and disliked about MathBot. In the second study, we compare the learning efficacy of the two platforms. In the third and main study, we leverage qualitative feedback from the first two studies to design personalized improvements to MathBot’s pedagogical policy.

4.1 Design of Study 1

In the first part of this within-subject study, we ask participants on Amazon Mechanical Turk to interact with MathBot and watch a 6-min Khan Academy video, and then solicit feedback on the two learning methods (Fig. 3). Despite their lack of interactivity, Khan Academy videos are competitive baselines, as they are carefully tailored by expert instructors and are demonstrably effective for teaching mathematical content (Weeraratne & Chin, 2018).

We conduct the second part of the study identically, except we recruit new users and replace the video with a written tutorial from Khan Academy containing embedded practice problems (Fig. 4). This second comparison provides an additional layer of insight, as one might conjecture that any result favoring MathBot over video instruction may simply be the result of MathBot providing an interface to work through problems.

To limit the length of the study, we use an abridged version of our developed MathBot content that covers only explicit formulas for arithmetic sequences, and pair that with either a Khan Academy video or a written tutorial that covers similar material. To avoid ordering effects—including anchoring bias and fatigue—we randomized the order in which participants saw MathBot and the Khan Academy video or written tutorial. Tables 2 and 3 in the "Appendix" summarize user attrition and filtering, which were similar across conditions^{Footnote 1}. After accounting for user attrition and the filtering criteria, 116 participants remained in the first part of the study and 111 participants in the second part. Tables 4 and 5 in the "Appendix" summarize the demographics of the filtered set of users. Our analysis is restricted to this filtered set of users.

4.2 Quantitative results

After study participants completed the MathBot and Khan Academy learning modules, we asked them a series of questions to quantify their experiences. In particular, we asked participants to answer the following question on a 7-point scale ranging from “strongly prefer” MathBot to “strongly prefer” the Khan Academy material: “If you had 60 min to learn more about arithmetic sequences and then take a quiz for a large bonus payment, which of these two scenarios would you prefer? 1. Interact with an expanded version of the conversational computer program, then take the quiz. 2. [Watch more videos / Complete more interactive tutorials] about arithmetic sequences, then take the quiz.” We note that the ordering of options 1 and 2 was randomized for each user.

The responses to this question for the first part of the study are presented in Fig. 5a. We found that 42% of participants stated at least a weak preference for MathBot, 53% stated at least a weak preference for Khan Academy videos, and 5% indicated a neutral preference. The corresponding results for the second part of the study are displayed in Fig. 5b. In that case, we found that 47% of the 110 participants who answered the question stated at least a weak preference for MathBot, 44% stated at least a weak preference for Khan Academy interactive tutorials, and 9% stated a neutral preference. Tables 6, 7, 8, and 9 in the "Appendix" summarize the experiential ratings and time-on-task of participants in Study 1.

Overall, more of our participants preferred Khan Academy materials to MathBot—a testament to the quality of Khan Academy. The highly polarized response distribution, however, also illustrates the promise of new forms of instruction to address heterogeneous learning preferences. Indeed, 20% of users in the first part of the study and 18% of users in the second part expressed a “strong preference” for MathBot over Khan Academy material.

4.3 Qualitative results

After each part of the study, we asked users to respond to the following prompt: “Please compare your experience with the conversational computer program and the [video / interactive tutorial]. In what scenarios could one learning method be more effective or less effective than the other?” We analyzed the resulting comments to identify themes and understand users’ perspectives on MathBot and the Khan Academy videos and written tutorials. One author conducted open coding to identify common themes addressed by each response. Another author verified the coded labels and resolved conflicts with discussion. We discuss the coded categories at length in the Appendix, but highlight one theme in particular, that of pacing, here. We found that different users expressed different sentiments about the pacing of the lessons. For example, one participant noted, “as it gets more complicated, the lesson should slow down a bit,” while another indicated, “I felt like the teaching went too slow for me.” We return to this feedback later on, seeking to address it via personalization, slowing down or speeding up the conversation for each learner as appropriate.

4.4 Design of Study 2

We next sought to evaluate whether MathBot produced comparable learning outcomes to Khan Academy material. To assess educational gains, we randomly assigned participants to learn about arithmetic sequences via: (1) a full-length MathBot conversation; or (2) a combination of Khan Academy videos and written tutorials covering the same content as the MathBot conversation. We assessed learning outcomes with a 12-question quiz, giving the same quiz before and after each participant completed the learning module^{Footnote 2}. Similar filtering criteria to Study 1 resulted in our analyzing 182 subjects assigned to MathBot and 187 assigned to Khan Academy materials. Table 10 in the "Appendix" summarizes user attrition and filtering in Study 2, and Table 11 summarizes user demographics (Fig. 6).

4.5 Results

We start by computing the proportional learning gain (PLG) for each subject. To calculate PLG, we first determine the raw learning gain by subtracting the pre-learning quiz score from the post-learning quiz score. We divide this result by the maximum possible score increase, defined as the difference between the maximum possible post-learning score (12) and the user’s pre-learning score. Figure 7 shows the distribution of the PLG. We find the average PLG for MathBot users is 65%, with a 95% confidence interval of [58%, 72%]; the corresponding average PLG for Khan Academy users is 60%, with a 95% confidence interval of [53%, 67%]. The gains from MathBot are slightly higher than those from Khan Academy, but the difference is not statistically significant (two-mean t-test, p = 0.15, 95% CI: [-2%, 12%]). MathBot and Khan Academy users spent comparable time completing the learning modules—28 min on average for MathBot (SD = 20) and 29 min for the Khan Academy videos and written tutorials (SD = 22). Table 12 in the "Appendix" summarizes raw learning outcomes of participants in Study 2, and Table 13 in "Appendix" summarizes performance on individual questions in the pre- and post-learning assessments.

5 Learning a pedagogical policy

Here we return to feedback from users in Study 1 who expressed mixed sentiments about the pacing of MathBot and address their concerns by learning a personalized pedagogical policy for pacing. Given that the MathBot conversation is structured as a series of lessons, each consisting of a conceptual explanation followed by an assessment question, we could potentially adjust pacing of a lesson in one of four ways: (1) show the conceptual explanation and show an isomorphic practice question before the assessment question (slowest); (2) show the conceptual explanation but skip the isomorphic practice question; (3) skip the conceptual explanation but show the isomorphic practice question; and (4) skip the conceptual explanation and skip the isomorphic practice question (fastest). Figure 8 illustrates these four actions.

We took a data-driven approach to learning a personalized pedagogical strategy that selects between these four actions for each user and question. We specifically chose to use a contextual bandit, a tool from the reinforcement learning literature which balances exploring actions whose payoffs are unclear with exploiting actions whose payoffs are believed to be high (Li et al., 2010). For each user and question, the bandit selects one of the four above actions based on the user’s pre-learning quiz score (the context). For example, the algorithm might learn to speed up the conversation for users with high pre-learning quiz scores and slow it down for those with low scores. We note that we had access to many more contextual features than the pre-learning quiz score, such as scores on individual quiz items and self-reported academic history of study participants. However, to best mimic a real-life learning scenario where a tutor has access to only a coarse measure of prior knowledge, such as a grade in a prior course, we choose to use the pre-learning quiz score as the sole context.

To train a contextual bandit, we must not only specify the actions but also the objective function (the reward) over which the algorithm will optimize.^{Footnote 3} Recall that our motivation for improving our pedagogical strategy was to personalize the pacing of the lesson, with the goal of either slowing down the chat to boost comprehension or speeding it up without sacrificing learning. These dual desiderata suggest defining our reward as a linear combination of the total time spent on a lesson and an indicator of whether the user gets the assessment question correct on their first try:

$$\begin{aligned} 150 \cdot \mathbf{1 }_{\text {correct}} - \text {seconds spent on lesson}. \end{aligned}$$

In other words, we assume it is worth 150 seconds of extra time spent on a lesson to turn a student who would have answered the assessment question incorrectly into a student who answered the question correctly. In particular, we expected the lesson to take around 30 min for 12 concepts, giving 2.5 min (or 150 seconds), for each concept. It bears emphasis that the precise form of the reward function should be set by domain experts and depends on the situation. For example, in a setting where a chatbot was augmented by a human tutor, we might increase the relative worth of time compared to correctness to account for the opportunity cost of having the concept explained by the tutor. Finally, we note that our reward is defined at the level of an individual lesson: later, we consider whether the contextual bandit’s strategy is also optimizing a global reward defined at the level of the entire learning session.

5.1 Design of Study 3

Our goal is to assess the value of using a contextual bandit to learn a personalized pedagogical strategy for students. We benchmark the bandit to a common alternative: a regression fit on data from users who were randomly assigned to one of the four possible actions before each assessment question. That is, in the benchmark approach, we first conduct an exploration phase, in which we assign users to the four actions uniformly at random; then, we fit a regression on the collected data to learn a personalized policy. The bandit, in contrast, aims to better manage exploration by down-weighting actions that are learned to be ineffective.

To carry out this comparison, we first recruited 30 participants from Amazon Mechanical Turk and assigned them to each of the four actions at random, independently for each question. Data from this pilot phase were used to provide the bandit a warm start. We then randomly assigned the remaining participants to either: (1) the contextual bandit condition; or (2) the uniform random condition (Fig. 9).

We use the same criteria as in Study 2 to filter participants before they interact with MathBot. These filtering criteria resulted in 228 subjects assigned to MathBot with a uniform random policy and 239 assigned to MathBot with a contextual bandit policy. We note that both groups include the 30 participants from the pilot phase: they are included in the uniform random group since their actions were given uniformly at random, and they are included in the bandit group as the bandit learned its initial policy from those individuals. Table 14 in the “Appendix” summarizes user dropout during experimentation, and Table 15 summarizes learning outcomes.

5.2 Results

We examine the behavior of the contextual bandit algorithm along three dimensions: (1) its degree of personalization; (2) the quality of the final learned pedagogical policy; and (3) the cost of exploration. We found that the bandit learned a personalized policy comparable in quality to the one learned on the uniform random data but, importantly, did so while imposing less burden on users.

5.2.1 Personalization

We begin by examining the pedagogical policy ultimately learned by the contextual bandit (i.e., the policy the bandit believed to be the best at the end of the experiment, after seeing 239 participants). Averaged over all questions, the final, learned policy assigns approximately 30% of users to each of the concept-only, isomorph-only, and no-concept-no-isomorph conditions; the remaining 10% are assigned to the concept-plus-isomorph condition.^{Footnote 4} In Fig. 10, we disaggregate the action distribution by question, showing the result for 4 representative questions out of the 11 total. The plot shows that the bandit is indeed learning a policy that differs substantially across users and questions. For only 3 of the 11 questions does the bandit determine that it is best to use the same action for every user—though even in these cases, each of the 3 questions have different selected actions.

5.2.2 Quality of learned solution

Next, we compare the expected reward of the learned policy from the bandit to that of the learned policy from the uniform random condition.^{Footnote 5} For the uniform random condition, we consider three different regression models: (1) a model with two-way interactions between actions and questions (effectively learning a constant policy per question); (2) a model with the same specification as the bandit, which is able to personalize based on pre-learning quiz score; and (3) a lasso regression that includes eight contextual covariates: pre-learning quiz score, accuracy on the previous question, time since starting the learning session, whether in the previous concept they were shown the conceptual explanation and/or isomorphic practice question, the time they spent on the previous concept, and the speed at which they set MathBot to send responses. Of these three models, the first performs the best.

We find that the bandit learned a policy which is comparable to the most successful policy learned from the uniform random condition, and further, that both the bandit and uniform random strategies learned a policy which outperformed the original policy from Studies 1 and 2 of always showing the concept without an isomorphic practice question. In Table 1, we display the average expected rewards of the two learned policies, along with the four policies which use the same action constantly. In particular, we find no statistically significant difference between the average reward obtained by the final bandit policy and the policy learned from the uniform random data. A 95% confidence interval for this difference in average rewards is [− 11, 28], slightly in favor of the policy learned in the uniform random condition. Finally, we note one additional advantage of the contextual bandit, which is that it can continually refine its learned solution given additional users, whereas traditionally the pedagogical policy would be fixed after concluding the uniform random experiment.

Table 1 For Study 3, average expected reward per question with 95% confidence intervals for the final policy learned from the bandit and uniform random conditions

Bandit algorithms to personalize educational chatbots

Abstract

Similar content being viewed by others

A Rule-Based Chatbot Offering Personalized Guidance in Computer Programming Education

How to Design and Deliver Courses for Higher Education in the AI Era?

Effective and Scalable Math Support: Experimental Evidence on the Impact of an AI-Math Tutor in Ghana

Explore related subjects

1 Introduction

2 Related work

2.1 Chatbots

2.2 Conversational tutors in education

2.3 Intelligent tutoring systems and example-tracing tutors

2.4 Learning pedagogical strategies with bandits

3 MathBot system design and development

3.1 Sample learner interaction with MathBot

3.2 MathBot’s front-end chat and back-end conversation graph

4 Evaluating MathBot

4.1 Design of Study 1

4.2 Quantitative results

4.3 Qualitative results

4.4 Design of Study 2

4.5 Results

5 Learning a pedagogical policy

5.1 Design of Study 3

5.2 Results

5.2.1 Personalization

5.2.2 Quality of learned solution

5.2.3 The cost of exploration

6 Discussion

6.1 Limitations

6.2 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Study design details

1.2 Additional qualitative results, Study 1

1.3 Description of Thompson sampling

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation