Abstract
To emulate the interactivity of in-person math instruction, we developed MathBot, a rule-based chatbot that explains math concepts, provides practice questions, and offers tailored feedback. We evaluated MathBot through three Amazon Mechanical Turk studies in which participants learned about arithmetic sequences. In the first study, we found that more than 40% of our participants indicated a preference for learning with MathBot over videos and written tutorials from Khan Academy. The second study measured learning gains, and found that MathBot produced comparable gains to Khan Academy videos and tutorials. We solicited feedback from users in those two studies to emulate a real-world development cycle, with some users finding the lesson too slow and others finding it too fast. We addressed these concerns in the third and main study by integrating a contextual bandit algorithm into MathBot to personalize the pace of the conversation, allowing the bandit to either insert extra practice problems or skip explanations. We randomized participants between two conditions in which actions were chosen uniformly at random (i.e., a randomized A/B experiment) or by the contextual bandit. We found that the bandit learned a similarly effective pedagogical policy to that learned by the randomized A/B experiment while incurring a lower cost of experimentation. Our findings suggest that personalized conversational agents are promising tools to complement existing online resources for math education, and that data-driven approaches such as contextual bandits are valuable tools for learning effective personalization.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Math learners can now turn to a wide variety of freely available online resources, from Khan Academy to Massive Open Online Courses (MOOCs). However, many of these resources cannot completely reproduce features of in-person tutoring, like giving students the sense that they are engaged in a back-and-forth exchange with a tutor, tailored feedback, and guidance about how to allocate their attention between reading explanations and practicing problems. Existing online math platforms have recently moved towards these desiderata with features like personalized feedback and guidance. For example, online math homework tools like ASSISTments (Heffernan & Heffernan, 2014; Feng et al., 2009) give feedback on common wrong answers. Further, online resources like MathTutor (Aleven et al., 2009a) build on example-tracing tutors (Aleven et al., 2016), which model the progression of a lesson with a behavior graph that: (1) outlines potential student actions, such as providing common incorrect responses; and (2) specifies the feedback, explanation, or new problem that should follow those actions. That approach aims to reduce development time while achieving some of the benefits of intelligent tutoring systems for mathematics, like personalized selection of problems (Nye et al., 2018; Falmagne et al., 2013; Craig et al., 2013; Winkler et al., 2020).
One consequence of online math education shifting from static media to adaptive intelligent tutoring systems is the dramatic increase in potential for personalization of the platform. When developing interactive platforms, a content designer must choose an appropriate pedagogical strategy: for example, whether the topic should be conveyed through conceptual lessons or practice problems, and the degree to which feedback should be provided. To choose an optimal pedagogical strategy for every new piece of content, one could turn to cognitive and educational experts and draw from educational theory, such as aptitude treatment interaction (Snow, 1989). In practice, however, it can be difficult to operationalize such theories to create effective strategies (Zhou et al., 2017). Furthermore, the large number of avenues for personalization, some of which may not have been previously investigated in the literature, along with the data available in online platforms suggests a more computational approach for learning personalized pedagogical policies.
The traditional method to compare the efficacy of various policies is to run a randomized A/B experiment. However, running such an experiment may not be feasible or desirable in adaptive education platforms due to high exploration costs: many users may be assigned to a bad pedagogical policy before the experiment is over, leading to deleterious effects on their learning experience. An alternative to traditional randomized experiments is the contextual bandit, a popular technique from the reinforcement learning (RL) literature (Li et al., 2010). Compared to traditional A/B tests, bandit algorithms can often learn personalized strategies with substantially less experimentation, leading to improved user experiences.
Our paper builds upon the aforementioned intelligent tutoring systems, moving from adaptive platforms to an actual conversational interface that closely mimics some key facets of conversation with a human tutor. Specifically, we designed and evaluated a prototype chatbot system, which we call MathBot. To achieve conversational flow and mirror the experience of interacting with a human tutor, we paid close attention to the timing of prompts and also incorporated informal language and emoji. As with a human tutor, the MathBot system alternates between presenting material and gauging comprehension. MathBot also provides learners with personalized feedback and guidance in the form of explanations, hints, and clarifying sub-problems. Finally, we built into MathBot the capability of learning personalized pedagogical policies via both contextual bandits and randomized experiments, allowing us to compare the two strategies in a live deployment.
To evaluate MathBot, we carried out three user studies on Amazon Mechanical Turk. The first study sought to determine whether users preferred to use MathBot over comparable online resources and, through qualitative feedback from users, elucidate potential avenues for improving MathBot through personalization. At a high level, we found that users were polarized in their preferences, with about half preferring MathBot. Specifically, 116 participants completed (in a randomized order) both an abridged lesson about arithmetic sequences with MathBot and a video on Khan Academy covering similar content; these participants then rated their experiences. We found that 42% of users preferred learning with MathBot over the video, with 20% indicating a strong preference. An additional 110 participants completed the same abridged lesson with MathBot along with a written tutorial from Khan Academy containing embedded practice problems. In this case, 47% of these users preferred learning with MathBot over the written tutorial, with 18% stating a strong preference. While MathBot was not preferred by the majority of our participants, our results point to potential demand for conversational agents among a substantial fraction of learners.
The second study sought to determine whether MathBot produced learning gains on par with comparable online resources. We randomized 369 participants to either complete a full-length conversation with MathBot about arithmetic sequences or complete a set of videos and written tutorials from Khan Academy covering similar content. To test their knowledge, each subject took an identical quiz before and after completing their assigned learning module. Under both conditions, participants exhibited comparable average learning gains and learning times: 65% improvement for MathBot, with a mean learning time of 28 min (SD = 20), and 60% improvement given Khan Academy material, with a mean learning time of 29 min (SD = 22); we note that the difference in learning gain was not statistically significant.
Given that a subset of users indeed preferred MathBot to conventional learning tools, we explored the potential of contextual bandits to learn personalized pedagogical policies for MathBot in the third and main study. For this experiment, we recruited 405 participants to complete a full-length conversation with MathBot about arithmetic sequences. Unlike the first two studies, in which the possible conversation paths were the same for each user, the third study leveraged a version of MathBot that could choose, for each user, whether or not to present certain conceptual lessons and also whether or not to provide certain supplemental practice questions. We randomized participants between two experimentation strategies—one in which actions were chosen by a contextual bandit and another in which actions were randomly chosen (i.e., an A/B design)—with the ultimate goal of reducing learning time without reducing learning gains. This goal was motivated by feedback from users in the first two studies, some of whom commented that the pacing of the lesson was too slow and some too fast, suggesting that personalizing the speed of the lesson could be beneficial. We found that, during experimentation, users assigned to the contextual bandit condition took less time (mean difference of 149 seconds, with a 95% confidence interval of [32, 266]) to complete the lesson and were less likely to drop out, despite scoring equivalently on the post-learning assessment as those assigned to the A/B design condition. Finally, we compared the quality of the learned post-experimentation policies using offline policy evaluation techniques, finding no statistically significant difference between the quality of the policies learned by the contextual bandit and the randomized experiment.
In summary, our contributions are threefold: (1) MathBot, a prototype system that adds conversational interaction to learning mathematics through solving problems and receiving explanations; (2) a live deployment of a contextual bandit in a conversational educational system; (3) evidence that a contextual bandit can continuously personalize an educational conversational agent at a lower cost than a traditional A/B design.
2 Related work
We briefly review past work on building chatbots, conversational tutoring systems, example-tracing tutors, and other intelligent tutoring systems (ITSs). We also survey the use of reinforcement learning algorithms in these systems.
2.1 Chatbots
Chatbots have been widely applied to various domains, such as customer service (Xu et al., 2017), college management (Bala et al., 2017), and purchase recommendation (Horzyk et al., 2009). One approach to building a chatbot is to construct rule-based input-to-output mappings (Al-Rfou et al., 2016; Yan et al., 2016). One can also embed chatbot dialogue into a higher-level structure (Bobrow & Winograd, 1977) to keep track of the current state of the conversation, move fluidly between topics, and collect context for later use (Walker & Whittaker, 1990; Seneff, 1992; Chu-Carroll & Brown, 1997). We envisioned MathBot as having an explicit, predefined goal of the conversation along with clear guidance and control of intermediate steps, so we took the approach of modeling the conversation as a finite-state machine (Raux & Eskenazi, 2009; Quarteroni & Manandhar, 2007; Andrews et al., 2006), where user responses update the conversation state according to a preset transition graph.
2.2 Conversational tutors in education
Conversational tutors in education often build complex dialogues. For example, one might ask students to write qualitative explanations of concepts (e.g., A battery is connected to a bulb by two wires. The bulb lights. Why?) and initiate discussions based on the responses (Graesser et al., 2001). AutoTutor and its derivatives (Nye et al., 2014; VanLehn et al., 2002; Graesser et al., 1999, 2004) arose from Graesser et al. (1995) investigation of human tutoring behaviors and modeled the common approach of helping students improve their answers by way of a conversation. These systems rely on natural language processing (NLP) techniques, such as regular expressions, templates, semantic composition (VanLehn et al., 2002), LSA (Graesser et al., 1999; Person, 2003), and other semantic analysis tools (Graesser et al., 2007). Nye et al. (2018) added conversational routines to the online mathematics ITS ALEKS by attaching mini-dialogues to individual problems but left navigation to be done via a website. MathBot aims to have the entire learning experience take place through a text conversation, giving the impression of a single tutor. More broadly, MathBot differs from past work on NLP-based conversational tutors in that it explores the possibility of reproducing part of the conversational experience without handling extensive open-ended dialogue, potentially reducing development time.
2.3 Intelligent tutoring systems and example-tracing tutors
A wide range of intelligent tutoring systems in mathematics use precise models of students’ mathematical knowledge and misunderstandings (Ritter et al., 2007; VanLehn, 1996; Aleven et al., 2009a, b; O’Rourke et al., 2015). To reduce the time and expertise needed to build ITSs, some researchers have proposed example-tracing tutors (Koedinger et al., 2004; Aleven et al., 2009b, 2016). Specifically, example-tracing tutors allow content designers to specify the feedback that should appear after students provide certain answers and then record those action-feedback pairs in a behavior graph (Aleven et al., 2016). Using the Cognitive Tutor Authoring Tools (CTAT), Aleven et al. (2009a, b) built MathTutor, a suite of example-tracing tutors for teaching 6th, 7th, and 8th grade math. Our work draws on insights from example-tracing tutors in that we build a graph which encodes rules that determine how MathBot responds to specific student answers, though our approach differs in that we display these responses in a conversational format.
2.4 Learning pedagogical strategies with bandits
To allow MathBot to personalize elements during live deployment, we incorporate a contextual multi-armed bandit algorithm (Lai & Robbins, 1985; Li et al., 2010), a tool from reinforcement learning for discovering which actions are effective in different situations (contexts). Other reinforcement learning approaches have been applied in education, typically for offline learning. Ruan et al. (2019) increase student performance by combining adaptive question sequencing with a NLP-based conversational tutor for teaching factual knowledge, but use a combination of random selection and a probabilistic model of learners’ knowledge of particular items to order questions. Lee et al. (2014) describe a framework to learn personalized pedagogical policies for DragonBox Adaptive, a K–12 math puzzle platform, without the support of an expertly-designed cognitive model. Chi et al. (2011) use another popular technique from RL to learn an effective pedagogical strategy for making micro-decisions, such as eliciting the next step of the problem versus revealing it, in an NLP-based ITS teaching college-level physics. Lan and Baraniuk (2016) describe a contextual bandit framework to assign students to an educational format and optimize performance on an immediate follow-up assessment, but evaluate the performance of the framework offline and do not personalize the actual lessons. A key difference between these studies and MathBot is that it is rare to use these strategies online in a live educational deployment. Only a handful of studies have begun to explore live deployments for sequencing problems (Clement et al., 2015; Segal et al., 2018), and none that we are aware of do so to learn which actions to take in a conversation.
3 MathBot system design and development
MathBot allows users to learn math topics through conversation-style interaction, rather than simply browsing online resources like videos, written lessons, and problems. Below we give an illustrative example of a learner interacting with MathBot, describe MathBot’s front-end of an interactive chat, and outline its back-end of a conversation graph which specifies the rules by which it progresses through concepts and chooses actions to take based on user responses.
3.1 Sample learner interaction with MathBot
Suppose a student, Alice, wants to learn about arithmetic sequences by interacting with MathBot. To start the interaction, MathBot greets Alice and asks her to extend the basic sequence “2, 4, 6, 8 \(\ldots\)”. Alice correctly answers “10”, so MathBot provides positive feedback (e.g., “Good work! ”) and begins a conceptual explanation of recognizing patterns in sequences. MathBot asks Alice if she is ready to complete a question to check her understanding, and Alice responds affirmatively. Alice progresses successfully through a series of additional explanations and questions.
Following an explanation of common differences, Alice is asked a new question: “What’s the common difference of 2, 8, 14, 20, \(\ldots\)?”. Figure 1 displays the conversation rules that underlie Alice’s current question. When asked the new question, Alice confuses the term “common difference” with “greatest common factor”, a topic she recently reviewed, so she answers “2”. MathBot recognizes that Alice has made a mistake and subsequently checks that she knows how to identify terms in a sequence and subtract them, a prerequisite task for finding the common difference (Fig. 1ii). Alice answers correctly, so MathBot begins to ask her a series of additional sub-questions to further clarify the concept of common differences (Fig. 1iii). Alice successfully completes these sub-questions, so MathBot directs her back to the original question. Alice remembers learning that the common difference is the difference between consecutive terms, though she mistakenly subtracts 8 from 2 and answers “I think it’s −6”. Rather than have Alice finish a redundant series of sub-questions, MathBot recognizes that Alice has made a common mistake, subsequently provides specific feedback to address that mistake, and then allows Alice to retry the original question (Fig. 1iv). Alice answers the original question correctly and proceeds to a new question on identifying decreasing arithmetic sequences (Fig. 1v).
3.2 MathBot’s front-end chat and back-end conversation graph
The front-end of MathBot is a text chat window between MathBot and the student (Fig. 2a, b). Students type replies to MathBot to give answers to problems, providing responses like “I’m not sure”. Students can freely scroll through the chat history to review explanations or questions.
Drawing inspiration from example-tracing tutors (Koedinger et al., 2004; Aleven et al., 2009b, 2016), the MathBot back-end consists of a conversation graph that specifies a set of if-then rules for how learner input (e.g., “I’m ready” or “The answer is 6”) leads to MathBot’s next action (e.g., give a new problem or provide feedback). In this rule-based system, the state of the conversation is represented as a finite state machine (FSM). In this FSM, each state is a response provided by MathBot, and user responses route the user along different paths in the conversation graph. For example, the question asked at the top of Fig. 1 is a state, and responses to that question (e.g., “I don’t know” or “6”) route users to a new state. MathBot uses fuzzy matching and basic string equivalence to parse responses and route users appropriately.
4 Evaluating MathBot
We first validate MathBot in two studies comparing it to Khan Academy, a high-quality, free, and widely-used online resource for math tutorials and problems that delivers content in a non-conversational format. In the first study, we investigate user preferences between the two platforms and solicit qualitative feedback on what users liked and disliked about MathBot. In the second study, we compare the learning efficacy of the two platforms. In the third and main study, we leverage qualitative feedback from the first two studies to design personalized improvements to MathBot’s pedagogical policy.
4.1 Design of Study 1
In the first part of this within-subject study, we ask participants on Amazon Mechanical Turk to interact with MathBot and watch a 6-min Khan Academy video, and then solicit feedback on the two learning methods (Fig. 3). Despite their lack of interactivity, Khan Academy videos are competitive baselines, as they are carefully tailored by expert instructors and are demonstrably effective for teaching mathematical content (Weeraratne & Chin, 2018).
We conduct the second part of the study identically, except we recruit new users and replace the video with a written tutorial from Khan Academy containing embedded practice problems (Fig. 4). This second comparison provides an additional layer of insight, as one might conjecture that any result favoring MathBot over video instruction may simply be the result of MathBot providing an interface to work through problems.
To limit the length of the study, we use an abridged version of our developed MathBot content that covers only explicit formulas for arithmetic sequences, and pair that with either a Khan Academy video or a written tutorial that covers similar material. To avoid ordering effects—including anchoring bias and fatigue—we randomized the order in which participants saw MathBot and the Khan Academy video or written tutorial. Tables 2 and 3 in the "Appendix" summarize user attrition and filtering, which were similar across conditionsFootnote 1. After accounting for user attrition and the filtering criteria, 116 participants remained in the first part of the study and 111 participants in the second part. Tables 4 and 5 in the "Appendix" summarize the demographics of the filtered set of users. Our analysis is restricted to this filtered set of users.
4.2 Quantitative results
After study participants completed the MathBot and Khan Academy learning modules, we asked them a series of questions to quantify their experiences. In particular, we asked participants to answer the following question on a 7-point scale ranging from “strongly prefer” MathBot to “strongly prefer” the Khan Academy material: “If you had 60 min to learn more about arithmetic sequences and then take a quiz for a large bonus payment, which of these two scenarios would you prefer? 1. Interact with an expanded version of the conversational computer program, then take the quiz. 2. [Watch more videos / Complete more interactive tutorials] about arithmetic sequences, then take the quiz.” We note that the ordering of options 1 and 2 was randomized for each user.
The responses to this question for the first part of the study are presented in Fig. 5a. We found that 42% of participants stated at least a weak preference for MathBot, 53% stated at least a weak preference for Khan Academy videos, and 5% indicated a neutral preference. The corresponding results for the second part of the study are displayed in Fig. 5b. In that case, we found that 47% of the 110 participants who answered the question stated at least a weak preference for MathBot, 44% stated at least a weak preference for Khan Academy interactive tutorials, and 9% stated a neutral preference. Tables 6, 7, 8, and 9 in the "Appendix" summarize the experiential ratings and time-on-task of participants in Study 1.
Overall, more of our participants preferred Khan Academy materials to MathBot—a testament to the quality of Khan Academy. The highly polarized response distribution, however, also illustrates the promise of new forms of instruction to address heterogeneous learning preferences. Indeed, 20% of users in the first part of the study and 18% of users in the second part expressed a “strong preference” for MathBot over Khan Academy material.
4.3 Qualitative results
After each part of the study, we asked users to respond to the following prompt: “Please compare your experience with the conversational computer program and the [video / interactive tutorial]. In what scenarios could one learning method be more effective or less effective than the other?” We analyzed the resulting comments to identify themes and understand users’ perspectives on MathBot and the Khan Academy videos and written tutorials. One author conducted open coding to identify common themes addressed by each response. Another author verified the coded labels and resolved conflicts with discussion. We discuss the coded categories at length in the Appendix, but highlight one theme in particular, that of pacing, here. We found that different users expressed different sentiments about the pacing of the lessons. For example, one participant noted, “as it gets more complicated, the lesson should slow down a bit,” while another indicated, “I felt like the teaching went too slow for me.” We return to this feedback later on, seeking to address it via personalization, slowing down or speeding up the conversation for each learner as appropriate.
4.4 Design of Study 2
We next sought to evaluate whether MathBot produced comparable learning outcomes to Khan Academy material. To assess educational gains, we randomly assigned participants to learn about arithmetic sequences via: (1) a full-length MathBot conversation; or (2) a combination of Khan Academy videos and written tutorials covering the same content as the MathBot conversation. We assessed learning outcomes with a 12-question quiz, giving the same quiz before and after each participant completed the learning moduleFootnote 2. Similar filtering criteria to Study 1 resulted in our analyzing 182 subjects assigned to MathBot and 187 assigned to Khan Academy materials. Table 10 in the "Appendix" summarizes user attrition and filtering in Study 2, and Table 11 summarizes user demographics (Fig. 6).
4.5 Results
We start by computing the proportional learning gain (PLG) for each subject. To calculate PLG, we first determine the raw learning gain by subtracting the pre-learning quiz score from the post-learning quiz score. We divide this result by the maximum possible score increase, defined as the difference between the maximum possible post-learning score (12) and the user’s pre-learning score. Figure 7 shows the distribution of the PLG. We find the average PLG for MathBot users is 65%, with a 95% confidence interval of [58%, 72%]; the corresponding average PLG for Khan Academy users is 60%, with a 95% confidence interval of [53%, 67%]. The gains from MathBot are slightly higher than those from Khan Academy, but the difference is not statistically significant (two-mean t-test, p = 0.15, 95% CI: [-2%, 12%]). MathBot and Khan Academy users spent comparable time completing the learning modules—28 min on average for MathBot (SD = 20) and 29 min for the Khan Academy videos and written tutorials (SD = 22). Table 12 in the "Appendix" summarizes raw learning outcomes of participants in Study 2, and Table 13 in "Appendix" summarizes performance on individual questions in the pre- and post-learning assessments.
5 Learning a pedagogical policy
Here we return to feedback from users in Study 1 who expressed mixed sentiments about the pacing of MathBot and address their concerns by learning a personalized pedagogical policy for pacing. Given that the MathBot conversation is structured as a series of lessons, each consisting of a conceptual explanation followed by an assessment question, we could potentially adjust pacing of a lesson in one of four ways: (1) show the conceptual explanation and show an isomorphic practice question before the assessment question (slowest); (2) show the conceptual explanation but skip the isomorphic practice question; (3) skip the conceptual explanation but show the isomorphic practice question; and (4) skip the conceptual explanation and skip the isomorphic practice question (fastest). Figure 8 illustrates these four actions.
We took a data-driven approach to learning a personalized pedagogical strategy that selects between these four actions for each user and question. We specifically chose to use a contextual bandit, a tool from the reinforcement learning literature which balances exploring actions whose payoffs are unclear with exploiting actions whose payoffs are believed to be high (Li et al., 2010). For each user and question, the bandit selects one of the four above actions based on the user’s pre-learning quiz score (the context). For example, the algorithm might learn to speed up the conversation for users with high pre-learning quiz scores and slow it down for those with low scores. We note that we had access to many more contextual features than the pre-learning quiz score, such as scores on individual quiz items and self-reported academic history of study participants. However, to best mimic a real-life learning scenario where a tutor has access to only a coarse measure of prior knowledge, such as a grade in a prior course, we choose to use the pre-learning quiz score as the sole context.
To train a contextual bandit, we must not only specify the actions but also the objective function (the reward) over which the algorithm will optimize.Footnote 3 Recall that our motivation for improving our pedagogical strategy was to personalize the pacing of the lesson, with the goal of either slowing down the chat to boost comprehension or speeding it up without sacrificing learning. These dual desiderata suggest defining our reward as a linear combination of the total time spent on a lesson and an indicator of whether the user gets the assessment question correct on their first try:
In other words, we assume it is worth 150 seconds of extra time spent on a lesson to turn a student who would have answered the assessment question incorrectly into a student who answered the question correctly. In particular, we expected the lesson to take around 30 min for 12 concepts, giving 2.5 min (or 150 seconds), for each concept. It bears emphasis that the precise form of the reward function should be set by domain experts and depends on the situation. For example, in a setting where a chatbot was augmented by a human tutor, we might increase the relative worth of time compared to correctness to account for the opportunity cost of having the concept explained by the tutor. Finally, we note that our reward is defined at the level of an individual lesson: later, we consider whether the contextual bandit’s strategy is also optimizing a global reward defined at the level of the entire learning session.
5.1 Design of Study 3
Our goal is to assess the value of using a contextual bandit to learn a personalized pedagogical strategy for students. We benchmark the bandit to a common alternative: a regression fit on data from users who were randomly assigned to one of the four possible actions before each assessment question. That is, in the benchmark approach, we first conduct an exploration phase, in which we assign users to the four actions uniformly at random; then, we fit a regression on the collected data to learn a personalized policy. The bandit, in contrast, aims to better manage exploration by down-weighting actions that are learned to be ineffective.
To carry out this comparison, we first recruited 30 participants from Amazon Mechanical Turk and assigned them to each of the four actions at random, independently for each question. Data from this pilot phase were used to provide the bandit a warm start. We then randomly assigned the remaining participants to either: (1) the contextual bandit condition; or (2) the uniform random condition (Fig. 9).
We use the same criteria as in Study 2 to filter participants before they interact with MathBot. These filtering criteria resulted in 228 subjects assigned to MathBot with a uniform random policy and 239 assigned to MathBot with a contextual bandit policy. We note that both groups include the 30 participants from the pilot phase: they are included in the uniform random group since their actions were given uniformly at random, and they are included in the bandit group as the bandit learned its initial policy from those individuals. Table 14 in the “Appendix” summarizes user dropout during experimentation, and Table 15 summarizes learning outcomes.
5.2 Results
We examine the behavior of the contextual bandit algorithm along three dimensions: (1) its degree of personalization; (2) the quality of the final learned pedagogical policy; and (3) the cost of exploration. We found that the bandit learned a personalized policy comparable in quality to the one learned on the uniform random data but, importantly, did so while imposing less burden on users.
5.2.1 Personalization
We begin by examining the pedagogical policy ultimately learned by the contextual bandit (i.e., the policy the bandit believed to be the best at the end of the experiment, after seeing 239 participants). Averaged over all questions, the final, learned policy assigns approximately 30% of users to each of the concept-only, isomorph-only, and no-concept-no-isomorph conditions; the remaining 10% are assigned to the concept-plus-isomorph condition.Footnote 4 In Fig. 10, we disaggregate the action distribution by question, showing the result for 4 representative questions out of the 11 total. The plot shows that the bandit is indeed learning a policy that differs substantially across users and questions. For only 3 of the 11 questions does the bandit determine that it is best to use the same action for every user—though even in these cases, each of the 3 questions have different selected actions.
5.2.2 Quality of learned solution
Next, we compare the expected reward of the learned policy from the bandit to that of the learned policy from the uniform random condition.Footnote 5 For the uniform random condition, we consider three different regression models: (1) a model with two-way interactions between actions and questions (effectively learning a constant policy per question); (2) a model with the same specification as the bandit, which is able to personalize based on pre-learning quiz score; and (3) a lasso regression that includes eight contextual covariates: pre-learning quiz score, accuracy on the previous question, time since starting the learning session, whether in the previous concept they were shown the conceptual explanation and/or isomorphic practice question, the time they spent on the previous concept, and the speed at which they set MathBot to send responses. Of these three models, the first performs the best.
We find that the bandit learned a policy which is comparable to the most successful policy learned from the uniform random condition, and further, that both the bandit and uniform random strategies learned a policy which outperformed the original policy from Studies 1 and 2 of always showing the concept without an isomorphic practice question. In Table 1, we display the average expected rewards of the two learned policies, along with the four policies which use the same action constantly. In particular, we find no statistically significant difference between the average reward obtained by the final bandit policy and the policy learned from the uniform random data. A 95% confidence interval for this difference in average rewards is [− 11, 28], slightly in favor of the policy learned in the uniform random condition. Finally, we note one additional advantage of the contextual bandit, which is that it can continually refine its learned solution given additional users, whereas traditionally the pedagogical policy would be fixed after concluding the uniform random experiment.
5.2.3 The cost of exploration
The above results indicate that one can indeed learn a personalized pedagogical policy using a contextual bandit that is on par with one learned from uniform random data. The primary value of a bandit, however, is that it incurs lower costs of exploration by quickly learning which actions are unlikely to be beneficial. We thus now directly compare the average rewards obtained under the bandit and uniform random conditions during the model-learning period. Higher average reward during model-learning suggests users are having a better experience, as they receive sub-optimal actions less often.
We first compute the average reward for each lesson in both conditions, and then average that quantity over all the lessons for both conditions. This gives us the the average reward per lesson per user for both conditions. As shown in Fig. 11 (left panel), the average reward in the contextual bandit condition is substantially higher than in the uniform random condition. A 95% confidence interval on the difference is [9.6, 29.1].Footnote 6 In Fig. 12 in the "Appendix" we plot the cumulative average over the total local rewards per user as a function of the number of users in each condition, finding that the bandit quickly improves upon the uniform random policy in terms of average reward.
As another way to assess the cost of exploration, we compute the average value of a global reward function across users in our two conditions—bandit and uniform random. Analogous to the local reward function, the global reward is defined as:
In contrast to the local reward function, the global reward considers the total post-learning quiz score and total time spent on the entire MathBot conversation, rather than correctness and time spent during individual lessons.
Figure 11 (right panel) shows the average global rewards of participants between the two conditions. We find that the bandit obtains considerably higher average global rewards than the uniform random condition, with the difference being 171 (95% CI [18–324], p = 0.029). We note further that the difference is mostly driven by users in the bandit condition taking far less time to finish the MathBot conversation. Table 15 breaks down the average learning gains and lesson times for users in the two conditions. The average difference in lesson times is 149 seconds (95% CI [32–266], p = 0.012), translating to the bandit saving around 12% of the uniform policy’s lesson time, while users in both conditions scored roughly the same on the post-learning quiz, with the difference being .2 questions (95% CI [-.53–.93], p = .59), slightly in favor of the bandit. The bandit was only designed to optimize for local rewards, so this result offers further evidence that the bandit is learning a generally effective policy.
As a final way to assess user satisfaction during exploration, we examine the difference in dropout rates between the two conditions. A user is said to “drop out” if they complete the first MathBot lesson but not the final lesson, either skipping to the post-learning quiz or leaving the experiment. Out of the participants in the bandit condition, 9% dropped out, compared to 15% of participants in the uniform random condition—a statistically significant gap of 6 percentage points (two-proportion z-test, \(p<0.05\)). This result again suggests the bandit provides an improved user experience while learning a pedagogical policy.
6 Discussion
6.1 Limitations
One potential shortcoming of MathBot and similar conversational tutoring systems is the time needed to develop and test the underlying conversation graph. On the other hand, since MathBot does not require researchers to develop NLP algorithms and models for conversation, it has one of the strengths of example-tracing tutors: those without extensive machine-learning expertise, including high-school instructors, could feasibly participate in development. In addition, Study 3 demonstrated the successful use of contextual bandit algorithms to learn how to personalize elements of the conversation graph—specifically, when to skip conceptual explanations and when to give additional practice problems. This result provides one demonstration of how such components could be learned via a data-driven process after deployment, further minimizing the development time.
It is worth discussing whether our success in using a contextual bandit to learn a pedagogical policy might generalize to other learning scenarios. One of the main theoretical concerns of using a contextual bandit in learning scenarios is that it may not be able to optimally handle long-term dependencies (e.g., skipping the first conceptual explanation hurts performance on the eighth concept). Much work has thus explored more complicated strategies for learning personalized pedagogical strategies which require more data (Chi et al., 2011; Ruan et al., 2019). In particular, we point out two features of our setting which are actually encouraging in this respect: (1) the lesson contains many concepts, most of which build upon one another, and (2) our bandit, despite being designed with a local reward function, was still able to learn more effectively than a uniform random policy even when evaluated with a global reward function. These two points of evidence suggest that bandits, despite theoretical concerns, may still have value in learning pedagogical policies even in complex and path-dependent learning scenarios.
An important limitation of our study is that we evaluated MathBot using a convenience sample of adults from Amazon Mechanical Turk. While Mechanical Turk workers have been shown to exhibit similar video-watching behavior and quiz performance as MOOC learners (Davis et al., 2018), it would be valuable to test our system with a population actively exposed to algebra instruction, such as high school students or remedial adult learners in college. Our study also does not address the implications of using MathBot as a major component of a full-length course. For example, we did not investigate knowledge retention, and we do not know whether students would enjoy using MathBot less or more if they used it to learn over the course of several weeks or months. One potential upside of using MathBot over a longer period of time is that if students changed in their aptitude, MathBot would automatically adjust its lessons to that student. Since MathBot’s applicability to a classroom setting is yet to be explored, future work could consider how this approach would be received and used by teachers. For example, would MathBot be most useful as homework, as an optional supplementary resource, or as in-class practice?
Additionally, our system taught a single algebra topic, arithmetic sequences, with a conversation intended to last approximately 30 min (Studies 2 and 3) and could be less than 10 min (Study 1). Furthermore, because Khan Academy is an independent platform, we were unable to deeply investigate video-watching and tutorial-completing behavior of participants in Studies 1 and 2. Further investigation is necessary to understand exactly which of our insights might generalize to other learning scenarios, including longer interaction periods, different topics in mathematics, and different learning formats such as games (Lee et al., 2014).
6.2 Conclusion
In this work, we developed and studied the effect of an interactive math tutoring system: MathBot. Although the content of MathBot closely matched that of the Khan Academy materials, we found evidence of heterogeneous learning preferences. MathBot produced learning gains that were somewhat higher than those of Khan Academy, though the gap was not statistically significant. Finally, we found that a contextual bandit was able to efficiently learn a personalized pedagogical policy for showing extra practice problems and skipping explanations to appropriately alter the pace of the MathBot conversation, outperforming a randomized experiment.
We note several directions for further work. We found that the bandit was able to learn as effective a policy as the randomized A/B experiment while requiring less time: however, we did not study what might happen in a setting where the time of the lesson was fixed and the bandit instead had to optimize learning gains given the fixed time allotted for the lesson. Additionally, given the challenge of fully exploring a substantial number of actions and a sizable context space with only a limited number of interactions with real users, the contextual bandit used in Study 3 had access to only four actions with one contextual variable. If a future iteration of MathBot were released to a larger audience, the bandit could explore additional actions, such as entirely skipping topics or providing more than one additional practice question, and could leverage additional contextual variables, such as users’ stated preferences for learning via conceptual explanations versus example problems, or individual pre-quiz answers. Furthermore, the choice of learning media itself could be personalized with either a contextual bandit or another technique from the reinforcement learning literature: one could certainly imagine specific students or concepts being better suited for conversation than video or vice-versa.
Several users in Study 1 noted the benefit of interacting with multiple learning modules, and past work has demonstrated that prompting users with relevant questions periodically during a video may improve learning outcomes (Shin et al., 2018). Accordingly, one could explore integrating brief conversations with MathBot into educational videos or, conversely, video elements could be used in the MathBot conversation. Though MathBot interactively guides learners through explanations and relevant questions, it does not provide a platform for extensive rote practice after finishing the conversation. An adaptive question sequencing model such as DASH [29] could be used to guide students through an optimized sequence of practice problems by accounting for student performance during the MathBot conversation. We hope that future work will investigate the potential of intelligent tutoring systems that incorporate multiple modes of teaching and learn to personalize themselves to individual student needs.
Notes
Additional details on study design are listed in the appendix, including criteria for user filtering.
Additional details on study design are listed in the appendix.
To train the bandit, we used a linear model with Thompson sampling, a technique known to have strong empirical performance and theoretical guarantees (Agrawal & Goyal, 2013). We model the reward using ordinary least squares (OLS) regression where the covariates are the contextual variables, the actions, and the two-way interaction terms between the contextual variables and the actions. Then, we simply choose action a with a probability proportional to its posterior likelihood of being the best action. See "Appendix" for more details.
These numbers do not represent the distribution of actions actually assigned during the experiment, but rather the distribution of actions under the policy the bandit ultimately learned.
We never observe the actual outcomes of implementing these policies, as that would require running another costly experiment. We instead make use of standard offline policy evaluation techniques to compare the pedagogical strategies learned by the bandit and the uniform random experiment (Li et al., 2012). The specific quantity we are interested in is the expected average reward for a random user drawn from the population distribution. To compute the expected average reward for this policy on question i, we evaluate the average reward on question i in our uniform random condition where the user was randomly assigned into the action our bandit policy would have chosen. We then average these rewards over all the questions and compute standard errors by bootstrapping the uniform random data. We perform a similar method for the policies trained on the uniform random data, except we choose an action for person p using a model trained on all the uniform random data except those of person p to avoid overfitting.
We compute the standard errors for our uniform random condition through the bootstrap. The standard errors for the bandit condition are obtained by fitting a response surface model on the uniform random data and running simulations.
References
Agrawal, S., & Goyal, N. (2013, May). Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning (pp. 127–135). PMLR.
Al-Rfou, R., Pickett, M., Snaider, J., Sung, Y.-H., Strope, B., & Kurzweil, R. (2016). Conversational contextual cues: The case of personalization and history for response ranking. arXiv preprint arXiv:1606.00372
Aleven, V., McLaren, B. M., & Sewall, J. (2009a). Scaling up programming by demonstration for intelligent tutoring systems development: An open-access web site for middle school mathematics learning. IEEE Transactions on Learning Technologies, 2(2), 64–78
Aleven, V., Mclaren,B. M., Sewall, J., & Koedinger, K. R. (2009b). A new paradigm for intelligent tutoring systems: Example-tracing tutors. Technical report
Aleven, V., McLaren, B. M., Sewall, J., van Velsen, M., Popescu, O., Demi, S., Ringenberg, M., & Koedinger, K. R. (2016). Example-tracing tutors: Intelligent tutor development for non-programmers. International Journal of Artificial Intelligence in Education, 26(1), 224–269
Andrews, P., De Boni, M., Manandhar, S., & De, M. (2006) Persuasive argumentation in human computer dialogue. In AAAI spring symposium: Argumentation for consumers of healthcare (pp. 8–1)
Bala, K., Kumar, M., Hulawale, S., & Pandita, S. (2017). Chat-bot for college management system using ai. International Research Journal of Engineering and Technology, 4(11), 2030–2033.
Bobrow, D. G., & Winograd, T. (1977). An overview of krl, a knowledge representation language. Cognitive science, 1(1), 3–46
Chi, M., VanLehn, K., Litman, D., & Jordan, P. (2011). Empirically evaluating the application of reinforcement learning to the induction of effective and adaptive pedagogical strategies. User Modeling and User-Adapted Interaction, 21(1), 137–180.
Chu-Carroll, J., Brown,M. K. (1997) Tracking initiative in collaborative dialogue interactions. In Proceedings of the eighth conference on European chapter of the association for computational linguistics (pp. 262–270). Association for Computational Linguistics
Clement, B., Oudeyer, P.-Y., Roy, D., & Lopes, M. (2015). Multi-armed bandits for intelligent tutoring systems., 7(2), 20–48
Craig, S. D., Hu, X., Graesser, A. C., Bargagliotti, A. E., Sterbinsky, A., Cheney, K. R., Okwumabua, T. & Cheney, S. (2013). The impact of a technology-based mathematics after-school program using ALEKS on student’s knowledge and behaviors. Computers & Education, 68, 495–504
Davis, D., Hauff, C., & Houben, G.-J. (2018) Evaluating crowdworkers as a proxy for online learners in video-based learning contexts. In Proceedings of the ACM on human-computer interaction (pp. 42:1–42:16). ACM
Falmagne, J.-C., Albert, D., Doble, C., Eppstein, D., & Hu,X. (2013) Knowledge spaces: Applications in education. Springer Science & Business Media
Feng, M., Heffernan, N., & Koedinger, K. (2009). Addressing the assessment challenge with an online system that tutors as it assesses. User Modeling and User-Adapted Interaction, 19(3), 243–266
Graesser, A. C., Lu, S., Jackson, G. T., Mitchell, H. H., Ventura, M., Olney, A., & Louwerse, M. M. (2004). AutoTutor: A tutor with dialogue in natural language. Behavior Research Methods, Instruments, & Computers, 36(2), 180–192
Graesser, A. C., Penumatsa, P., Ventura, M., Cai, Z., & Hu, X. (2007). Using lsa in autotutor: Learning through mixed initiative dialogue in natural language. Handbook of latent semantic analysis, 243–262
Graesser, A. C., Person, N. K., & Magliano, J. P. (1995). Collaborative dialogue patterns in naturalistic one-to-one tutoring. Applied Cognitive Psychology, 9(6), 495–522
Graesser, A. C., VanLehn, K., Rosé, C. P., Jordan, P. W., & Harter, D. (2001). Intelligent tutoring systems with conversational dialogue. AI magazine, 22(4), 39
Graesser, A. C., Wiemer-Hastings, K., Wiemer-Hastings, P., & Kreuz, R. (1999). AutoTutor: A simulation of a human tutor. Cognitive Systems Research, 1(1), 35–51
Heffernan, N. T., & Heffernan, C. L. (2014). The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education, 24(4), 470–497
Horzyk, A., Magierski, S., & Miklaszewski, G. (2009). An intelligent internet shop-assistant recognizing a customer personality for improving man-machine interactions. Recent Advances in intelligent information systems, 13–26
Koedinger, K. R., Aleven, V., Heffernan, N., Mclaren, B., & Hockenberry, M. (2004). Opening the door to non-programmers: Authoring intelligent tutor behavior by demonstration. Technical report
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1), 4–22
Lan, A. S., & Baraniuk, R. G. (2016). A contextual bandits framework for personalized learning action selection. In Proceedings of the 9th international conference on educational data mining (pp. 424–429)
Lee, S. J., Liu, Y.-E., & Popovic, Z. (2014) Learning individual behavior in an educational game: A data-driven approach. In Proceedings of the 7th international conference on educational data mining (EDM) (pp. 114–121)
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. WWW, 2010, 661–670
Li, L., Chu, W., Langford, J., & Wang, X. (2011, February). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 297–306).
Nye, B. D., Graesser, A. C., & Hu, X. (2014). AutoTutor and family: A review of 17 years of natural language tutoring. International Journal of Artificial Intelligence in Education, 24(4), 427–469
Nye, B. D., Pavlik, P. I., Windsor, A., Olney, A. M., Hajeer, M., & Hu, X. (2018). SKOPE-IT (Shareable Knowledge Objects as Portable Intelligent Tutors): overlaying natural language tutoring on an adaptive learning system for mathematics. International Journal of STEM Education, 5(1), 12
O'Rourke, E., Andersen, E., Gulwani, S., & Popović, Z. (2015, April). A framework for automatically generating interactive instructional scaffolding. In Proceedings of the 33rd annual ACM conference on human factors in computing systems (pp. 1545–1554).
Person, N. K. (2003). AutoTutor improves deep learning of computer literacy: Is it the dialog or the talking head. Artificial intelligence in education: Shaping the future of learning through intelligent technologies, 97, 47
Quarteroni, S., & Manandhar, S. (2007). A chatbot-based interactive question answering system. Decalog 2007, 83
Raux, A., & Eskenazi, M. (2009) A finite-state turn-taking model for spoken dialog systems. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics (pp. 629–637). Association for Computational Linguistics
Ritter, S., Anderson, J. R., Koedinger, K. R., & Corbett, A. (2007). Cognitive tutor: Applied research in mathematics education. Psychonomic Bulletin & Review, 14(2), 249–255
Ruan, S., Jiang, L., Xu, J., Tham, B. J. K., Qiu, Z., Zhu, Y., … Landay, J. A. (2019, May). Quizbot: A dialogue-based adaptive learning system for factual knowledge. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–13).
Segal, A., David, Y. B., Williams, J. J., Gal, K., & Shalom, Y. (2018, June). Combining difficulty ranking with multi-armed bandits to sequence educational content. In International conference on artificial intelligence in education (pp. 317–321). Cham: Springer.
Seneff, S. (1992). Tina: A natural language system for spoken language applications. Computational linguistics, 18(1), 61–86
Shin, H., Ko, E.-Y., Williams, J. J., & Kim, J. (2018). Understanding the effect of in-video prompting on learners and instructors. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (p. 319). ACM
Snow, R. E. (1989). Aptitude-treatment interaction as a framework for research on individual differences in learning. A series of books in psychology (pp. 13–59). Advances in theory and research: Learning and individual differences
VanLehn, K. (1996). Conceptual and meta learning during coached problem solving (pp. 29–47)
VanLehn, K., Jordan, P. W., Rosé, C. P., Bhembe, D., Böttner, M., Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg, M., Roque, A., Siler, S., & Srivastava, R. (2002). The architecture of why2-atlas: A coach for qualitative physics essay writing. In International conference on intelligent tutoring systems (pp. 158–167). Springer
Walker, M., & Whittaker, S. (1990) Mixed initiative in dialogue: An investigation into discourse segmentation. In Proceedings of the 28th annual meeting on association for computational linguistics (pp. 70–78). Association for Computational Linguistics
Weeraratne, B., & Chin, B. (2018). Can khan academy e-learning video tutorials improve mathematics achievement in Srilanka? International Journal of Education and Development Using Information and Communication Technology, 14(3), 93–112
Winkler, R., Hobert, S., Salovaara, A., Söllner, M., & Leimeister, J. M. (2020). Sara, the lecturer: Improving learning in online education with a scaffolding-based conversational agent. In Proceedings of the 2020 CHI conference on human factors in computing systems (pp. 1–14)
Xu, A., Liu, Z., Guo, Y., Sinha, V., & Akkiraju, R. (2017). A new chatbot for customer service on social media. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 3506–3510). ACM
Yan, R., Song, Y., & Wu, H. (2016). Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval (pp. 55–64). ACM
Zhou, G., Wang, J., Lynch, C., & Chi, M. (2017). Towards closing the loop: Bridging machine-induced pedagogical policies to learning theories. In Proceedings of the 10th international conference on educational data mining (pp. 112–119)
Acknowledgements
We thank Keith Shubeck, Carol Forsyth, Ben Nye, Weiwen Leung, Ro Replan, and Sam Maldonado for helpful comments and discussions. This work was supported by the Office of Naval Research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Yuxi Li, Alborz Geramifard, Lihong Li, Csaba Szepesvari, Tao Wang.
Appendix
Appendix
1.1 Study design details
Study 1 Our study was conducted on Amazon Mechanical Turk and was restricted to adults in the United States. To qualify for the study, we required that participants pass two screening quizzes. The first was a brief, 5-question quiz to ensure participants had sufficient algebra knowledge to understand sequences, but did not already have advanced knowledge of arithmetic sequences. The second screening quiz consisted of a more in-depth set of 12 questions selected from a Khan Academy quiz on arithmetic sequences. We excluded participants who answered more than 50% of the questions correctly, reasoning that these individuals already had substantial knowledge of sequences. Users were paid a bonus proportional to their score on a post-learning quiz. This performance-based payment scheme was disclosed to participants at the start of the study to incentivize active engagement with MathBot, attentive watching of the Khan Academy video, and dutiful completion of the written tutorial. Finally, we excluded participants who spent less than 1 min on either MathBot or the Khan Academy learning module, reasoning that these individuals did not seriously engage with the material.
Study 2 Users assigned to Khan Academy had access to seven videos and four written tutorials with embedded practice problems, and they were informed that completing either the videos or the tutorials would sufficiently prepare them for the post-learning quiz. Users were incentivized to complete the learning module to the best of their ability with a bonus payment proportional to their performance on the post-learning quiz.
1.2 Additional qualitative results, Study 1
Self-pacing versus guidance In the first part of the study, 8 out of 116 users noted the benefits of freely navigating the video: “I can rewind them and fast forward if I already know the concept.” Similarly, 22 out of 111 users in the second part of the study indicated value in freely scrolling through the written tutorial. These users frequently indicated frustration with the inability to freely navigate the material in the MathBot conversation.
On the other hand, 6 users in the first part of the study preferred that MathBot adapted its speed to their progression through concepts and questions, unlike the video. Similar sentiments were echoed by 15 users in the second part of the study, who preferred that MathBot explicitly guided them through concepts, unlike the written tutorial. Furthermore, 8 users in the first part of the study noted value in being able to scroll through earlier parts of the MathBot conversation to review concepts.
Human elements and interactivity In the first part of the study, 7 out of 116 users found MathBot to be more agentic than the video. However, 9 users reported the opposite: “Even though it was a video, it felt like a more personal experience because it was a human voice talking versus just reading on the screen.” In the second part of the study, 12 out of 111 users indicated that MathBot provided a greater sense of interaction than the written tutorial.
Requiring users to evaluate their knowledge The video asked users to pause and think about problems; however, unlike MathBot, answering these questions correctly was not required. 22 out of 116 users in the first part of the study noted the value of MathBot holding them accountable for understanding concepts before progressing: “When watching the video, I wasn’t sure if I was actually understanding the concepts correctly.” Similarly, although the Khan tutorial embedded problems between text, users could easily skip them, and 21 out of 111 users in the second part of the study found that being held accountable aided their learning. 10 learners also valued that MathBot provided more specific feedback on their answers than the tutorial.
Combining learning modules 42 users in the first part of the study and 57 users in the second part of the study suggested that both tools could be particularly valuable in specific learning scenarios. For example, 8 users in the first part of the study thought the video was superior for learning concepts, whereas MathBot was better for learning how to apply those concepts: “The best option for me would be to watch the video first, and then take part in the conversational computer program so that I could verify my understanding.” Similarly, 16 users indicated that, like the video, the written tutorial introduced concepts more effectively. 25 users in the first part of the study and 15 users in the second part of the study found that videos and written tutorials (respectively) were superior for learning concepts that required complex or detailed explanations.
1.3 Description of Thompson sampling
As described in the text, we use a linear model with Thompson sampling to train the contextual bandit. Every time we get a new data point from a user answering a question, we run a linear regression on all previously recorded contexts and rewards. The linear regression contains all of the interactions between the context (a question identifier denoted by the indicator \(1_j\), which is 1 if the user is answering question j and 0 otherwise, and the pre-quiz score p) and the actions (whether an isomorph was shown, denoted by the indicator \(1_i\), and whether the explanation of the concept was skipped, denoted by the indicator \(1_s\)). The reward is denoted by r.
Given the results of the regression, we have a posterior distribution over the coefficients, which is multivariate normal. This distribution is
We sample a random \(\beta _t\) from this distribution. Then, we iterate over every possible action a and compute the action which would maximize the reward given \(\beta _t\):
where c are the contextual variables and x are the covariates, which are computed from the contextual variables and the action terms, as described previously. This action is then given to the learner, and the reward is recorded (see Tables 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15; Fig. 12).
Rights and permissions
About this article
Cite this article
Cai, W., Grossman, J., Lin, Z.J. et al. Bandit algorithms to personalize educational chatbots. Mach Learn 110, 2389–2418 (2021). https://doi.org/10.1007/s10994-021-05983-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-021-05983-y