Introduction

For artificial intelligence to enter the mainstream of educational technology and become a standard element of learning environments, design methodologies are needed that enable instructional designers to efficiently create AI-driven (artificial intelligence-driven) learning products. The most commonly used instructional design methodology, and the one which most instructional designers are trained to use, is ADDIE (analyze, design, develop, implement, and evaluate) (Branson et al. 1975; Branch 2009). ADDIE is used on many projects because it is relatively easy to manage and yields predictable results. However, ADDIE and other commonly used instructional design methodologies do not address the needs and potential of AI-driven learning systems that rely heavily on data.

In ADDIE’s original conception the evaluation, or E phase, occurs mainly at the end, after the implementation is complete. Nowadays ADDIE practitioners try to start formative evaluations earlier in the process, however it can be difficult to collect formative evaluation data under conditions that are similar to actual use. Formative evaluations are often conducted in the laboratory with individuals or focus groups (Treser 2015). In our experience at Alelo, customer representatives who are not the intended users frequently test systems. Once a system is deployed on the customer’s own computing infrastructure the designers may have little or no access to data from the system in use.

Cloud computing can fundamentally change the role of learner data in instructional design. As learners access learning systems they provide a constant source of data both before and after system delivery. Cloud-based systems scale up easily, capturing unprecedented amounts of learner data. AI-driven learning environments are well positioned to take advantage of this data source.

This article describes an approach to the design and evaluation of intelligent learning environments, known as data-driven development (D3). In the D3 approach learner data inform design from the very beginning and throughout the product life cycle. Evaluation takes place continually. Development iterates rapidly, limited by the time it takes to analyze and interpret data, retrain models, and deploy updates. D3 is particularly relevant to fielded AIED systems, where learners may behave in unanticipated ways due to the environment in which they learn.

The article then presents two snapshot evaluations of Enskill® English, an AI-driven system for learning to speak English as a foreign language. Students at the University of Novi Sad in Serbia and the University of Split in Croatia participated in the evaluations. The collected data were used in multiple ways, in what would be distinct phases in the ADDIE model but which are combined in the D3 approach. Data from learners in Latin America and Europe were analyzed for comparison. The evaluations provided preliminary evidence that Enskill English is helpful for learning spoken English skills, and leading indicators that practice with Enskill English results in improved speaking skills. The article then provides broader recommendations for the use of the D3 methodology.

The Data-Driven Development Methodology

The D3 model arose out of the recognition that constant access to data was transforming our approach to instructional design, so that it no longer conformed to the conventional ADDIE model.

Instructional design in D3 is informed by learner behavior data throughout the product life cycle. Available learner data are captured, inspected, and mined as early as possible in the process, preferably in the initial needs analysis phase before development begins. This helps to identify specific learner behavior characteristics that should be addressed by the product. Data mining and analysis confirms hypotheses about learner needs derived from SME (subject matter expert) interviews, identifies additional needs which the SME interviews failed to uncover, and helps prioritize development based on the frequency and severity of learner problems identified in the data.

This begs the question of how to obtain learner data before a learning system is developed. One way is to use archived corpora collected by other systems from learners similar to the target learner profile. Organizations such as the PSLC Datashop (PSLC DataShop 2012) and the Linguistic Data Consortium (Linguistic Data Consortium 2018) have corpora that can be used to inform instructional design. Companies record customer conversations for training purposes. Developers of learning systems that capture learner data sometimes make them available in anonymized form for research purposes. As more learning systems run in the cloud more sources of learner data may become available that can be used to kickstart the development of new learning systems.

Another approach is to build an initial partial prototype or Wizard-of-Oz mockup and use it to gather data to kickstart the design process. This was the approach used to initiate the development of Enskill English. An initial version was deployed to collect data from testers and trial users. We now use Enskill English to collect data from new groups of users, which informs further development.

Figure 1 illustrates the D3 development cycle. It consists of three activities: Data mining and analysis, Development and model updates, and Deployment. In the Data mining and analysis phase, data is analyzed to gain insight regarding learner needs. The Development phase implements or updates the learning system, informed by these insights. It often uses machine learning to train models that are used in the running system. Then in the Deployment phase learners use the updated system, which generates more learner data so that the cycle can continue.

Fig. 1
figure 1

The D3 development cycle

Once the system is running data are collected and analyzed continually to identify system components that need updating, and to obtain training data necessary to perform the updates. In the case of Enskill English this analysis takes place on a weekly basis. As we extend Enskill English to new types of learners we first collect data from those learners to understand how the needs of those learners differ from those of learners who have used the system in the past.

D3 is in some ways similar to the SAM (Successive Approximation Model) method of instructional design (Allen 2012) in that it applies principles of agile software development (Beck et al. 2001), is highly iterative, and values working software at each iteration. Other methodologies for intelligent tutoring system development, such as the Design Implementation Framework (Stone et al. 2018), are iterative. D3 goes a step farther by valuing not just working software but also data at each iteration. It requires a rethinking of the role of data in the system lifecycle. In the past if systems generated data at all they were considered a byproduct and discarded. In the D3 approach data instead become an important input, driver, and result of system development.

D3 supports iterative educational research in real educational settings, similar to educational design research (McKenney and Reeves 2018). The data collected in each iterative development cycle is used to test hypotheses about the learning solution, and develop new hypotheses for testing in subsequent iterative cycles.

Some iterative, data-driven development efforts have been described in the literature. Amershi and Conati (2011) describe a two-stage data development process in which learner data are first analyzed off line to identify clusters of students, and then those clusters are used online to classify learners. Fournier-Viger et al. (2011) describe a two-stage process in which learner interaction data is mined for patterns, and then the tutoring system recognizes those patterns in the interactions of other students. Koedinger et al. (2013) describe data-driven optimization of intelligent tutoring systems but not iterative evaluation of successive versions with learners. Yet projects that engage in educational data mining (EDM) more typically report results from the completed system and do not describe how data mining informed iterative development either before or after deployment. Describing the process is valuable so that practitioners can better understand how to implement the process and transition to a more data-driven methodology.

Overview of Enskill English

Enskill English illustrates how development and evaluation takes place in the D3 development cycle. It is an immersive learning environment for developing spoken proficiency in English as a foreign language. It runs on the Alelo Enskill learning platform, a cloud-based AI-driven platform for developing communication skills. Enskill English is a finalist for the British Council’s 2019 award for digital innovation in English language teaching (British Council 2019).

Of the four main language skills—speaking, listening, reading, and writing—speaking is by far the most important (Thiriau 2017). Yet English language learners (ELLs) around the world struggle to develop their speaking and listening skills. They have limited opportunity to practice with native English speakers, and must practice with their English teacher or with other learners. The problem is particularly acute for large classes where speaking-practice activities are hard to organize and manage.

Enskill English gives learners opportunities for realistic spoken English practice with animated characters that speak and understand English (Fig. 2). They can practice as much as they want in a safe environment, which builds proficiency and self-confidence. It also can reduce anxiety about speaking, which affects foreign language achievement and student attrition (Bailey et al. 2003; Hsieh 2008).

Fig. 2
figure 2

Interaction with Enskill English

Enskill English’s AI serves as an aide intelligente, not as an artificial instructor. It is intended to act as an intelligent teacher’s aide that gives each learner individual attention while saving teachers time and effort so that they can teach more effectively. It automatically provides feedback and personalized instruction in the areas where learners need to improve. In future versions teachers will receive analytics about their students’ progress, effort, and performance, based in part on the analytics presented in this article, so that they can adapt their instruction to the needs of their students.

Learners access Enskill English through a web browser on their computer or mobile device. They converse with interactive characters by speaking into a microphone. The on-screen character interprets the learner’s speech and responds, and at the same time evaluates the learner’s communication skills. At the end of the exercise Enskill provides feedback and recommends exercises for further practice.

Enskill English is structured as a collection of task-based simulation modules. In each simulation the learner has a task to perform, such as buying a train ticket or getting directions to a destination. Simulations are organized into proficiency levels. The Common European Framework of Reference (CEFR) defines three levels for language proficiency: A (basic user), B (independent user) and C (proficient user). Each level is further subdivided as follows: A1, A2, B1, B2, C1, and C2. At the time of this writing Enskill English simulations are available for A1 and A2 levels, with more under development. Each level covers a semester-length course in English as a foreign language.

Each simulation is aligned with one or more CEFR can-do statements (ALTE 2002). One can-do statement at the A2 level is “can ask for basic information about travel and buy tickets.” In the corresponding Enskill English simulation the learner asks for information about trains and purchases a ticket. When learners master a simulation it shows they really can do what the can-do statement requires. Previously can-do statements served mainly as subjective self-assessments (ACTFL 2015).

Figure 3 shows an Enskill English simulation in which the learner’s task is to buy a train ticket to New York. The ticket agent has asked the learner where he is traveling to, and so the learner should tell her where he wants to go. Communicative functions such as telling someone where they want to go can be expressed in many different ways. Enskill provides each learner the minimum prompting necessary, and utilizes natural language understanding so that it can interpret a wide range of responses. This contrasts with other language learning products (e.g., Rosetta Stone, DynEd, Babbel, etc.) where learners select from a list of pre-authored choices and either read them aloud or click on them. If the learner needs help Enskill will offer some suggested responses (Fig. 3 top right), but the learner is not limited to these options. The transcript and hints are displayed in this example for the purpose of illustration; learners are encouraged to practice without reliance on hints and transcripts.

Fig. 3
figure 3

Beginning of the Enskill English train ticket simulation

Unlike chatbot applications designed for native speakers, Enskill English is designed to be usable by both language learners and language teachers. It must support both because if it does not perform well for language teachers they will not recommend it to their students. To support language learners it must be highly tolerant of language errors in the context of conversation. Figure 4 shows an utterance with multiple errors, transcribed as “I want to leave at first my.” The learner has used the preposition “at” instead of “on”, and meant to say “1st May” instead of “May 1st”. The misrecognition of the world “May” as “my” is likely due to a mispronunciation of the “ay” vowel in the word “May” as /ai/. Enskill incorporates a library of common word recognition errors which aids in understanding.

Fig. 4
figure 4

Examples of learner language errors

Another key difference from generic chatbot technology is the ability to assess the learner’s language skills and provide feedback and personalized instruction. After each simulation Enskill gives feedback about which task objectives the learner failed to satisfy. The feedback is provided after the dialogue has ended to avoid interrupting the conversation’s flow. Learners then have the option of trying the simulation again or completing practice exercises that focus on the language skills required to complete the unsatisified objectives. Each level has a bank of exercises from which to choose.

Figure 5 shows one such exercise where the learner practices asking about train arrival times. This type of exercise is called an utterance formation exercise; the learner is prompted to produce a spoken utterance that conveys a particular meaning or intent. In these exercises there is typically no one right answer. For example, instead of saying “What time will the train arrive?” the learner can also say “When will the train arrive?” Several equivalent responses are accepted, but the learner’s response must match one of the expected responses exactly.

Fig. 5
figure 5

An Enskill English practice exercise

The Enskill English instructional approach reflects the characteristic of spoken proficiency as both a motor skill (the ability to pronounce words) and a complex cognitive skill (the ability to understand and produce language). According to Fitts and Posner (1967), motor skill learning progresses through stages from the cognitive stage through the associative stage to the autonomous stage. Anderson (1982) generalized this to acquisition of cognitive skill, and described a process of proceduralization and knowledge compilation. These learning processes require practice.

The learning methodology in Enskill English is influenced by the instructional design methodology of van Merriënboer (1997), who argues that complex cognitive skills are best learned through a combination of whole-task practice and part-task practice. Dialogue simulations provide whole-task practice, while practice exercises provide part-task practice.

Enskill English contrasts with common apps such as Babbel, Rosetta Stone and Duolingo which offer some opportunities to practice words and phrases but few opportunities to practice speaking, to construct and produce their own responses without relying on prompts. Supiki (http://linguacomm.com), a mobile phone app that supports simulated conversations, accepts nonsense inputs and does not provide feedback.

A number of computer-aided language learning systems focus on pronunciation feedback, for example see (Engwall 2012). Learners are presented with sentence to read and are scored on their pronunciation. This focuses on language as a motor skill and overlook the key cognitive skills of listening comprehension and sentence production. Learners who focus on pronunciation but neglect these other skills will likely struggle in real-world conversation.

Enskill builds on previous work on immersive training systems for languages and cultures, including Tactical Language (Johnson 2010; Johnson et al. 2012) and the Virtual Cultural Awareness Trainers (VCATs) (Johnson et al. 2011). Evaluations of these systems (MCCLL 2008; Johnson 2015) have indicated that they help learners acquire skills that they apply when they engage with people in other countries. The evaluations of Enskill English in this article provide preliminary indications of that it too is effective, although further investigation of effectiveness is still necessary.

Enskill also builds on research in conversation-based sustems such as Subarashii (Bernstein et al. 1999), Mercurial (Xu and Seneff 2011), PONY (Lee et al. 2014), and HALEF (Evanini et al. 2017). These are promising prototypes but have not been applied at scale. Many previous systems (e.g., Subarashii, Tactical Language, PONY) match learner utterances against a library of pre-authored phrases, a technique that works well for basic language spoken by beginner learners but does not extend well the complex language typical of higher levels of language proficiency. A major goal of Enskill is to understand and respond to complex learner language, up to the CEFR B level. Some systems (e.g., Mercurial) use speech and language technology that is trained on native speakers; this can result in high word error rates on learner speech. Enskill’s speech and language technology is trained on native speakers but adapted for use by language learners. Previous systems have been tested on a small number of sample tasks; Enskill has been applied to entire course curricula.

Dialogue systems such as ITSPOKE (Littman and Silliman 2004), AutoTutor (Graesser et al. 2005), and BEETLE (Dzikovksa et al. 2011) are becoming increasingly common in educational applications, and toolkits such as Alexa Skills Kit make it ever easier to add dialogue capabilities to applications. But these lack the assessment and feedback capabilites that are essential for learning communication skills.

The Enskill Platform

Enskill English runs on the Enskill cloud-based learning platform for communication skills. Enskill is a multilingual platform configurable for a variety of native languages and target languages. It was recently evaluated with American learners of Modern Standard Arabic at the Army language school at Ft. Bragg, NC (Johnson et al. 2018). The Enskill DLE is used in Alelo’s Virtual Cultural Awareness Trainers (VCATs) (Johnson et al. 2011), immersive cultural awareness trainers that are available for over 90 countries and have been used by over 200,000 learners to date.

Figure 6 shows the Enskill system architecture. The core of the system is the Enskill SimServer, which is responsible for delivering simulation content to learners, providing computational support to run them, collecting performance data, and generating analytics. Enskill SimServers are currently running in North America, South America, Europe, and Southeast Asia to support learners throughout the world. Additional servers can be added as needed to serve learners in specific countries. The Enskill SimServer can interoperate with learning management systems (LMS) and other digital learning products. It supports the Learning Tools Interoperability (LTI) standard; support for the Caliper is in development. Enskill supports single sign-on so that learners can log into their institution’s LMS and access Enskill simulations interleaved with other learning activities.

Fig. 6
figure 6

The Enskill architecture

The Enskill DLE (Digital Learning Environment) is an HTML5-compatible runtime player that runs learning content in the learner’s Web browser and captures learner speech and interaction data. It is designed to run on any device and Web browser that supports microphone input, including Windows laptops, Firefox and Chrome on MacBooks, and Chrome on Android devices. We are also developing an immersive 3D interface in Unity for virtual-reality environments.

Authors create content for Enskill using the Enskill Builder. The current version of the Enskill Builder (Johnson and Valente 2008) is only available internally to Alelo personnel. Alelo intends to release a licensed version that enables customers to collaborate with Alelo to develop content, or develop content on their own.

The transition of Alelo technology to the cloud has accelerated innovation and made data-driven development possible. Earlier Alelo systems were delivered as self-contained learning products hosted on client computers and networks. This made access to learner data very difficult. Clients would not request updates so the systems tended to fall out of date. In contrast, the Enskill SimServer processes all learner speech data and automatically archives them for future analysis. When updates to Enskill are released into the cloud, learners get access to them immediately.

The Enskill SimServer uses cloud-based commercial speech and language processing services (sometimes referred to as cognitive services) to assist in the processing of learner responses. This allows to take advantage of the continuing improvements in such services. We adapt and extend these, using the data that we collect from language learners. When learners with heavily accented speech use Enskill, the speech recognizer often mis-recognizes some words. For example many ELLs have trouble pronouncing English short vowels, so the speech recognizer can mis-recognize the word “cups” as “cops”. To compensate for this we have used the data we collect to create a substitution table of common mistranscriptions of learner language, which a partial matching algorithm uses to find the best interpretation of the learner’s utterance. For example, if the speech recognizer transcribes a learner’s utterance as “I want a leaf on May 1st”, the substitution table identifies “leaf” as a possible mispronunciation of “leave”, and so Enskill can understand that the learner meant to say that she wants to leave on May 1st. This substitution table continues to improve as Enskill English collects more examples of mispronunciations. This approach has the advantage that it can be applied in the context of realistic dialogue; unlike computer-aided pronunciation training algorithms (e.g., Arora et al. 2017) it does not require learners to read text prompts off the screen.

The partial matching algorithm described above is an example of the data-driven development (D3) methodology at work. Data informed its initial design, and data informs its iterative development as well as its eventual replacement by new natural language understanding (NLU) algorithms.

Continuous Evaluation in D3

Although the learner data that D3 collects facilitates evaluation, rapid iterative development can make evaluation a challenge. The system being evaluated is a moving target. It is important for evaluations to provide actionable findings quickly, to inform development. D3 addresses this issue through a process of continous evaluation.

D3 utilizes multiple types of evaluations. In what follows tests refer to evaluations that analyze only learner data, while evaluations in general may analyze other data sources such as surveys and interviews in addition to learner data. Tests rely heavily on automated data analysis, while evaluations in general may employ mixed-methods approaches.

At the lowest level in the evaluation process is a series of instant tests: evaluations of learner and system performance at the current point in time. For Enskill English instant tests are performed weekly, analyzing data that was captured over the preceding week. Instant tests are limited to analyses that can be performed automatically, supplemented by spot inspections of the incoming data. Analysis activities such as rating student answers, transcribing recordings, and tagging data require effort, coordination, and time, so it is difficult to make use of them in instant tests. Instant tests can however identify data sets that warrant follow-on analysis.

Additionally, periodic snapshot evaluations are performed over a limited period of time with a selected learner population. Data and analytics from instant tests of the target population are aggregated and further analyzed. Feedback and survey data from students and instructors provide further insights about the strengths and weaknesses of the system. We then update the system and its underlying models. Data analysis generates new hypotheses. Subsequent instant tests and snapshot evaluations determine whether the system changes had the intended effects, and test the new hypotheses. Note that while instant tests can be performed at any time, snapshot evaluations must be scheduled in coordination with instructors and are subject to the constraints of the academic calendar.

Because all learner data are archived it is also possible to test multiple system versions on the same learner data sets. We call these tests regression tests. They help ensure that the observed differences in performance are due to system changes and not differences between learner populations.

Finally, D3 can make use of A/B evaluations. In A/B evaluations two different versions of the system are tested simultaneously with different sets of users (Kohavi and Longbotham 2017). Instant tests and snapshot evaluations can both in principle be conducted in A/B fashion, but since they run for a short period of time they can yield small sample sizes that make it difficult to identify statistically significant differences between A and B.

The snapshot evaluations described in this article were conducted in 2018. The first evaluation was conducted with a group of students in English for special purposes at the University of Novi Sad in Serbia. It utilized learner data collected by Enskill as well as survey data from the learners. The analyses were compared against instant tests of another set of learners in the Laureate International Network. Then in late 2018 another snapshot evaluation of the improved system was conducted with students at the University of Split in Croatia. Results were compared against instant tests with learners at the Universidad Privada del Norte in Peru. This evaluation tested the effects of previous changes that were made to the system and informed subsequent development.

The rapid evaluations in D3 necessitate tradeoffs between speed and thoroughness of analysis. Large data sets take time to accumulate, so D3 evaluations must make do with smaller data sets. High-quality analyses with human raters and transcribers take time and so can be integrated only to a limited extent. As better tools are developed, more of the analysis can be automated and moved into the instant tests and the time to complete the snapshot evaluations is reduced.

The choice of evaluation metrics also evolves over time, as our understanding of learner behavior and system performance improves. The portrayal of D3 in this article, with similar metrics applied at each snapshot, is a simplification for the purpose of exposition. The reality is that everything is evolving—the learning system, the analysis tools, and the evaluation metrics. Metrics that were useful in the earlier snapshots have been discarded and replaced by new and better ones.

The snapshot evaluations described in this article were conducted with learner populations that are different from what Enskill English was originally designed for. It was designed for beginner learners but was tested with intermediate learners. This may seem surprising—why design a learning system for one learner population and test it on a different learner population? The reason is that it enables us to iteratively extend Enskill English to these new learner populations. This illustrates the dual roles of evaluation in D3: to evaluate the current version of the system and to collect data to inform development of future versions of the system.

The current learning levels in Enskill English take a full semester a complete. That is too long a period of time for a snapshot evaluation. So instead we look for leading indicators of learning, preliminary evidence that learning is taking place. If intermediate summative assessments are integrated into the curriculum the results of those assessments can be used. Long-term effectiveness evaluations come later, as data from long-term use become available.

Snapshot Evaluation 1 at the University of Novi Sad

The first snapshot evaluation of Enskill English to be discussed here took place in April and May of 2018. The study population was a class of 80 students in an English for information technology purposes (EiT) program at the Faculty of Technical Sciences of the University of Novi Sad in Serbia. These students had passed an English placement test and were able to understand English at the CEFR B1 (i.e., intermediate) level. One was a native of Bosnia and Herzegovina, the rest were natives of Serbia. All were native Serbian speakers.

At the time of this evaluation the CEFR A1 level of Enskill English was in use by beginner learners and the A2 level was nearing completion. We were interested in extending Enskill to the B level, but we did not yet have data from B-level learners. We did not know whether B-level learners would find the A-level simulations useful, and where they might fall short. However, the director of the English program at the University of Novi Sad, Ms. Vesna Bulatović, indicated that students had limited opportunities to practice speaking English, so we hypothesized that the existing A1 and A2 level simulations might be beneficial for them. Ms. Bulatović cooperated with Alelo in collecting student surveys and ensuring that the students participated in the evaluation.

The evaluation objectives of the study were to test the following hypotheses:

  • Hypothesis 1. ELLs at the CEFR B level consider Enskill English to be a good way to practice English. They find it useful, fun, and easy to use.

  • Hypothesis 2. B-level ELLs benefit from practice with Enskill English A-level simulations.

As of the time of this study 1800 A-level English learners and their instructors had already used Enskill English, and the responses had been positive so far. So we expected this study to support Hypothesis 1. The status of Hypothesis 2 was much less certain, since the A-level simulations were not designed for B-level learners. The following are possible reasons why Hypothesis 2 might be true:

  • Learners can use simulations for review, to maintain mastery of language skills covered earlier in the language course.

  • They can practice vocabulary and structures common to real-world situations.

The following are some possible reasons why the study might not confirm Hypothesis 2:

  • B-level learners might regard the A-level conversations to be too easy for them.

  • B-level learners might use complex language that the A-level simulations could not handle.

The data collection and system testing objectives of the study were as follows:

  • To collect data from native Serbian speakers, in case the speech and language models need to be updated and retrained for Serbian speakers;

  • To collect data from B-level ELLs to inform future development; and

  • To test a new version of the Enskill dialogue system and compare it against the current version.

The remainder of this section gives an overview of the study design and then summarizes the survey findings. It then analyzes learner performance during the trial, as inferred from the learner data. It summarizes the characteristics of the speech data collected from the B-level Serbian speakers. Finally, it presents results from tests of the two versions of the dialogue system, starting with instant tests and then continuing with an in-depth analysis to interpret the instant test results and compare system performance. This is an illustration of the process of data-driven design research in D3.

Study Design

The study materials included two simulation modules at the A1 level and two simulation modules at the A2 level. The A1 simulations had been previously released; the A2 simulations were still beta versions that were in the process of final testing. The study lasted three weeks. During the study period the learners were directed to practice each simulation as a homework assignment until they could complete all the objectives. They were free to practice the simulations more times if they wished.

After the learners completed the trial they completed a survey of their attitudes toward Enskill English and the simulations. The survey included 5-point Likert-scale questions and free-form questions in which learners could describe what they liked about Enskill English and what they would like to see improved. The survey included a net promoter score question to determine whether they would recommend Enskill English to their family and friends.

Enskill kept track of the number of times each learner completed each simulation and whether the learner successfully completed all the task objectives. It also recorded the date and time of each learner’s first and last task completion, as well as the total time spent. We provided these statistics to the instructor so she could track the learners’ usage of the courseware.

During the trial the partial-matching NLU drove the on-screen character behavior. The substitution table included errors from speakers of several languages (Portuguese, Spanish, Thai, and Turkish), but none from Serbian speakers. A new classification-based NLU engine ran in the background on the same learner utterances. The classification-based NLU utilized Microsoft’s LUIS (Language Understanding Intelligent Service) package. Like most commercial providers of cognitive services, Microsoft does not disclose which particular machine learning method it uses in LUIS.

Survey Results

A total of 72 learners completed the completed the survey. 71 attempted the A1 simulations and 66 attempted the A2 simulations.

The net promoter score (NPS) from the survey was 10. An NPS above zero is considered good. There were 27 promoters (score of 9 or 10), 28 passives (score of 7 or 8), and 17 detractors (score of 0 to 6). NPS is calculated as (promoters – detractors) / total * 100.

Most learners agreed with the statement “Enskill English exercises are a good way to practice speaking English” (mean = 4.03 of a possible 5, s.d. = 0.69). Only one learner disagreed, and no learner strongly disagreed. Most agreed with the statement “Enskill English is easy to use” (mean = 4.06, s.d. = 0.85). Two learners disagreed and two learners strongly disagreed. These findings support Hypothesis 1: ELLs in this study considered Enskill English to be a good way to practice English.

Most learners agreed with the statement “The interactive conversations helped me with my English speaking and listening skills” (mean = 3.49, s.d. = 1.09). 8 learners disagreed and 5 learners strongly disagreed. This is consistent with Hypothesis 2 for most learners, under the assumption that if learners felt that the simulations helped them, they probably did help them. Conclusions from such self-reports should be considered tentative, although self-assessments are common practice in second language learning. Rubrics of self-assessed can-do statements are frequently used to evaluate spoken proficiency (ACTFL 2015), because there is currently no practical alternative. Teachers do not have time to assess each student through one-on-one conversation. Objective proficiency tests require scoring by trained human raters, and do not measure spoken competencies at a granular level.

Some learners commented that the dialogue choices were too limited and restrictive. This is not surprising because the NLU pipeline was designed to understand the relatively simple language of learners at the CEFR A level. Some learners also encountered bugs in the A2 beta simulations.

Most learners agreed with the statement “The interactive conversations are amazing” (mean = 3.83, s.d. = 0.91). Nine learners disagreed, but no learner strongly disagreed. This wording was chosen to get a sense of the learners’ subjective experience of practicing with AI-driven characters.

Here are some comments from the learners about what they liked about Enskill English:

  • It is very imteresting [sic] and it can help you to aprove [sic] your communication skills.

  • It is very useful for practice.

  • It’s very easy to use and you can learn a lot.

  • The variety of the conversations.

  • There are very interesting conversation [sic] and I really liked it.

Here are some comments from the learners about what they would like to see improved:

  • Add more variety of answers and possible questions.

  • I would add more themes for conversation.

  • Find the right activity for you, improve your English writing skills, improve your English reading skills.

  • Maybe some more professional conversation.

  • I don’t know, everything is fine.

Learner Performance Analysis

Of the learners who participated in the trial, the total time spent using Enskill English varied greatly (mean = 32:31, s.d. = 28:15, min = 00:24, max = 2:41:37). The variation was due mainly to the number of times the learners practiced the simulations. 80% of the learners practiced at least one simulation multiple times. Table 1 summarizes the simulation activity of samples of 16 learners on the CEFR A1 simulations and 16 learners on the CEFR A2 simulations. This data set includes 108 simulation trials, 54 of the CEFR A1 simulations and 54 of the CEFR A2 simulations. In 18 A1 trials and 18 A2 trials a learner tried a simulation just once; in the rest of cases the learners tried simulations multiple times.

Table 1 Summary of simulation activity for a sample of learners

Tables 2 and 3 show the performance of these learners in the simulations, and indicate how it improved with repeated practice. Completions are conversational tasks which the learners were able to complete, and full completions are completions in which the learners met all of the objectives. For the simulations that learners tried multiple times, the tables show performance on the first trial and performance on the last trial. One simulation run was excluded from analysis where the learner paused for 18 min at the beginning and did not continue. Elapsed time to complete a simulation decreased from an average of 7:01 to 4:12. When learners tried simulations multiple times they completed 66.67% of them on the first trial and 100% of them on the last trial. They completed the simulations with a full score 47.62% of the time on the first trial and 61.90% of the time on the last trial. Thus the learners who practiced the simulations multiple times appeared to be more concerned with completing the conversations than with achieving all of the objectives within the conversations.

Table 2 Overall learner performance over multiple simulation trials
Table 3 Learner performance in conversational exchanges over multiple simulation trials

Table 3 shows change in conversational performance in more detail. Here repeats are conversational exchanges in which either the learner or the on-screen character did not understand what the other party said, as indicated by responses such as “Sorry, I didn’t understand” or “Please repeat what you said.” Meaningful exchanges are exchanges which are not repeats. The average number of conversational exchanges per minute increased from 2.62 to 4.31. The number of meaningful exchanges increased and the number of repeats decreased, resulting in an increase in meaningful exchange rate from 62.65% to 82.35%.

These analytics can be interpreted as follows.

  • The increase in meaningful exchange rate suggests that the learners’ listening comprehension may be improving and/or the accuracy of the learner’s speech production may be increasing (since the on-screen character is able to understand what the learner said more consistently.)

  • The reduction in time per exchange suggests that learners’ listening comprehension may be improving and/or the cognitive fluency of the learners’ speech production may be increasing (learners may be spending less time reviewing hints and deciding what to say, and may be speaking with less hesitation).

These hypothesized explanations must be further tested; the improvements may be due in part to practice with the software. But if true they are significant since fluency and accuracy are key indicators of cognitive fluency in second language (Segalowitz 2010). They may be good leading indicators of learning, for use in feedback to students and instructors. Together with the self-reports that the simulations helped the learners improve their English speaking and listening skills, the indications of proficiency improvement are encouraging.

Data Collection Results

Review of the learner speech recordings collected during the study confirmed that the learners had an intermediate level of spoken proficiency. The learners produced a wide variety of utterances, which suggested that they were not relying on memorized phrases. For example, in response to the question “What day would you like to depart?” the learners produced a total of 184 different responses. The utterances used a range of modal verb phrases (e.g., “I want to”, “I have to”, “I would like to”,), verbs (“leave”, “depart”, “travel”, “take a trip”), and date phrases (e.g., “May 1st”, “May the first”, “first May”, “the first of May”, “on Sunday”). Some utterances were simple phrases (e.g., “on May 1st”) and others were complex sentences (e.g., “I would like to go at 1st of May and return at 7th of May”).

The learners’ speech was reasonably fluent but exhibited some disfluencies, such as sentence restarts. For the most part the learners’ pronunciation was fairly good, but they frequently mispronounced certain words, such as the word “May” which was often mispronounced as /mai/. Some learners had difficulty with voiced consonants at the ends of words, mispronouncing “of” as “off” or “bag” as “back”. Many of the utterances in the data set had grammatical errors, particularly in the use of prepositions and articles (e.g., “at May 1st”, “in the first May”, “at the Sunday”).

Instant Test Results

During the evaluation period a series of weekly instant tests was performed. The following analysis aggregates and summarizes the results of the instant tests during the evaluation period.

Instant tests must employ analytics that can be calculated on demand. We have evaluated various analytics and settled on raw understanding rate as a good metric to use. Raw understanding rate is the percentage of dialogue exchanges in which the learner spoke an utterance, the system recognized the learner’s intent, and replied. That and the false positive rate are important indicators of the user’s experience with the dialogue system. If the false positive rate is high, causing the animated characters to respond inappropriately in the dialogue, users quickly lose confidence in the reliability of the system. If the raw understanding rate is low users will frequently have to repeat or recast what they say in order to be understood, leading to frustration.

Table 4 shows the raw understanding rates for the learners in the Novi Sad trial. The raw understanding rates vary considerably between simulations, ranging from 63% to 85%.

Table 4 Raw understanding rates for the Novi Sad learners

Table 5 shows the raw understanding rates in the A1-level simulations from instant tests of A-level ELLs studying at Laureate International Network institutions in Chile, Costa Rica, Honduras, Mexico, Peru, Portugal, and Spain from May 28, 2018 until August 6, 2018. The raw understanding rates are 72% to 73%, comparable to raw understanding rates for the Novi Sad data. This suggests that the system performance with the Novi Sad students is not particularly unusual, but is similar to the average performance for a broad population of A-level learners.

Table 5 Raw understanding rates for Laureate International Network students

Table 6 shows the success rates for spoken practice exercises, i.e., the percentage of learner utterances that the system accepted as correct. Results are for utterance formation exercises and response formation exercises. In utterance formation exercises learners are prompted to produce an utterance that conveys a particular meaning or intent; in the response formation exercises the on-screen character speaks to the learner and the learner replies. In these exercises learners must produce well-formed utterances; communicating intent alone is not sufficient. The exercises have a pre-authored set of possible responses (correct and incorrect), and to succeed the learner’s response must match one of the correct responses exactly. Exercises are grouped by level; there is one group that accompanies the CEFR A1 dialogues and another group that accompanies the CEFR A2 dialogues.

Table 6 Success rates for spoken practice exercises

The learners completed a total of 130 spoken-language practice exercises. The average success rate for these exercises was 52%. Since these are A-level exercises and the learners were B-level learners, the learners should have found the exercises to be easy and the success rates should have been higher. Analysis of the utterances that were rejected indicated that 5% had very poor sound quality, 8% of the utterances were grammatically incorrect, 14% were grammatically correct but were still rejected as incorrect, 33% failed to conform to the exercise instructions (e.g., they were incomplete sentences when the learner was instructed to say a complete sentence), and 38% were due to errors in the ASR transcription. Grammatically incorrect utterances sometimes also had transcription errors, since the speech recognizer language model was trained on grammatically correct speech and is biased to recognize grammatically correct utterances, so it sometimes mistranscribes grammatically incorrect utterances as grammatically correct ones. Also if a learner speaks hesitantly, with long pauses between words, the speech recognizer sometimes stops transcribing at one of the pauses. The sample of completions is small, especially for the A2 exercises, so more data is needed in subsequent evaluations to confirm these observations for the A2 exercises.

In-Depth Analysis of Dialogue-System Performance

The NLU engine in the Enskill dialogue system interprets each learner utterance by mapping it to a semantic category known as an intent. We tested two NLU engines, the released engine running in the foreground and driving the interaction with the learner, and a new engine running in the background on the same learner data. Although it is common for developers to test algorithms off line on archival data, that is difficult in the Enskill case because learner responses that are appropriate in one version of the dialogue system may not make sense in the context on a different version of the dialogue system. Testing two versions on the same data stream makes it easy to compare versions.

The released NLU engine used the partial-matching algorithm described above, calculating the distance between the learner’s uttterance and a library of utterances, each of which is associated with one or more intents. The distance metric calculates the number of word-level insertions, deletions, and substitutions, giving less weight to substitutions that are known misrecognitions.

The NLU engine employed a set of text classifiers for each intent, each of which is trained on examples of utterances that express that intent. We expected that it would be particularly useful for intermediate-level learners who are able to construct new utterances that are quite far from known utterances in terms of additions, deletions, and substitutions. The classifier approach is tolerant of grammatical errors, which is important when processing learner language.

The classification-based NLU evaluates each classifier and selects the one with the highest score if it is acceptable for the current dialogue context, otherwise it rejects it. A dialogue graph determines which intents are acceptable at each point in the dialogue. Table 7 shows a breakdown of intents by simulation. The number of intents ranged from 46 for Helping Owen Plan a Party (asking and answering questions about a planned party) to 115 for Jerry’s Spaghetti (ordering a meal at an Italian restaurant). In Helping Owen Plan a Party between 1 and 3 intents were considered acceptable at any one time, while in Jerry’s Spaghetti anywhere between 2 and 19 intents were considered acceptable. In general the A2 simulations have more intents and more acceptable options.

Table 7 Classification-based NLU intent statistics

Further analysis of dialogue responses, involving annotations by human raters, was performed to measure dialogue system performance and interpret the observed raw understanding rate. Expressed in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), achieving high precision [TP / (TP + FP)] is extremely important for Enskill. High recall [TP / (TP + FN)] is also very desirable, but not quite as critical as high precision. It is also important to measure the frequency of true negatives (TN) in the sample, i.e., the incidence of unintelligible utterances due to learner language errors or recording problems. This imposes an upper limit on the ability of the NLU to understand learner speech, which varies depending upon the level of proficiency of the learners in the sample. Native speakers frequently have trouble understanding what language learners at the CEFR A level are trying to say and must request repetition; learners at the CEFR B level frequently engage in repair and reformulation to make themselves understood (St. Giles International 2018).

Table 8 shows an analysis of the performance of the two NLU pipelines on a sample of 108 learner utterances from the A2 simulations. A human analyst assessed the accuracy of the intents selected by each NLU engine by comparing the intents to the automatically generated transcription of the utterance, and reviewing the speech recordings of the utterances in cases where the transcription was unclear or appeared to have errors. Intent recognitions were labeled as correct (true positive) when the analyst agreed with the engine’s recognized intent, and incorrect if the analyst disagreed with the engine’s recognized intent. If the NLU engine rejected the utterance it was labeled as true negative if the utterance was unintelligible or would require clarification. It was labeled as false negative if the the analyst could understand the intent of the utterance. For example a learner responded to the question “Would you like something to drink?” by saying “Glass”. Both NLU engines rejected this, and it was labeled as true negative because it is not clear what the learner is asking for. Another learner said “I’d like lemon soda”, and this was labeled as false negative since it is clear what the learner is asking for. If the NLU selected an intent but the analyst considered the utterance unintelligible, this was categorized as false positive. Note that although this method has the advantage that it can be performed quickly, it is potentially biased toward the system’s intents.

Table 8 Intent recognition by the two NLU engines on a sample of Novi Sad utterances

The analysis showed that the partial-matching NLU had slightly more true positives (74% vs. 72%). It failed to interpret 26% of the utterances. Out of these, the human rater judged 6% as true negatives. This left 19% of the utterances which the human rater judged as false negatives. Precision was very high, 99%.

The classification-based NLU had a similar percentage of true negatives, 6%. It had slightly more false positives, 2%. Many of these false positives appear to be due to incorrectly trained classifiers. For example the classification-based NLU failed to detect a number of greetings, which are simple phrases and should have been easily recognized. 20% were judged as false negatives. The partial-matching NLU performed better on 12% of the utterances, and the classification-based NLU performed better on 12% of the utterances.

Although the partial-matching NLU engine performed slightly better on this data set, the classification-based NLU looked more promising once the classifiers are retrained to correct obvious classification errors. This is a hypothesis that would need to be tested in future iterative evaluations, after more training has been performed. We concluded that the best option would be to combine the two methods, since each engine was able to recognize intents that the other engine missed.

To better understand how the classification-based NLU performed, a sample of 457 learner utterances from A1 simulations was annotated by human raters. The annotators listened to the utterances, transcribed them, and then assigned intent categories to them. The annotators were provided with the list of acceptable intents extracted from the dialogue graph. When the annotator assigned an intent to an utterance, the utterance was judged as “in domain”. If the annotator judged that the intent of the utterance was not covered by the list of acceptable intents, the utterance was judged as “out of domain”. If the intent of the utterance could not be determined because it was unintelligible or incomplete, it was labeled as “unintelligible.” If the system’s intent matched the annotator’s intent the utterance was classified as “correct”. If the system’s intent was different from the annotator’s intent the utterance was classified as “error”. If the intent matched but it was not one of the acceptable intents the utterance was judged “out of order”.

Table 9 summarizes the results. 87.3% of the NLU intent assignments agreed with human annotators. The NLU failed to recognize 1.2% of utterances that were out of domain, as it was supposed to do. The “error” cases are further subdivided into “error: not recognized” (the NLU failed to generate an intent) and “error: misrecognized” (the NLU generated an intent which was different from the annotators’ choice). The misrecognition rate (i.e., false positives) was quite low, under 1%.

Table 9 Comparison of NLU output with intent annotations by human annotators

The word error rate for the automated speech recognizer (ASR) on a sample of 156 utterances in this data set was 9.6%. To arrive at this error rate, a human rater transcribed each utterance; the transcriptions were compared with the ASR-generated transcriptions and edit distances at the word level were calculated. Factors that contributed to word error rate included pronunciation errors, grammar errors (because the speech recognizer was biased toward grammatically correct native speech), and pauses (because the ASR stopped transcribing when it encountered a long pause).

Snapshot Evaluation 2 at the University of Split

Based on these evaluations we released a new version of the Enskill English CEFR A2 simulations with a new NLU pipeline that employs a combination of the partial-matching NLU and classification-based NLU. The Universidad Privada del Norte (UPN) in Peru started using it in its English classes. We also tested the CEFR A1 simulations with the new pipeline, but did not yet release them.

Another snapshot evaluation was performed in December 2018 with CEFR B1 learners at the University of Split in Croatia. 39 learners participated in this evaluation. Data from UPN students underwent instant tests for comparison. This snapshot evaluation was similar to the earlier snapshot evaluation, but addressed additional research questions posed by the University of Split Faculty of Science. The main differences from the Novi Sad shapshot evaluation were as follows.

  • The evaluation compared two versions of the curriculum, A and B. The A version was the same as the curriculum in the Novi Sad evaluation. In the B version learners were directed to complete the interactive language exercises first before attempting the dialogue simulations. This was intended to test whether prior practice results in better performance in the simulations.

  • It used the new NLU pipeline. No alternative NLU pipeline was included in the test.

  • It included one additional CEFR A1 simulation (School Newspaper Interviews), in which learners practice asking other students about their activities and interests.

  • The survey included additional questions to gain insight into the strengths and weaknesses of Enskill English, to determine where further improvements were needed.

General Observations Regarding the Collected Data

The speech recordings and survey responses indicated that the learners could express themselves fairly well in written English. They could communicate in spoken English with fewer pronunciation errors than the Novi Sad group. For example the word “May” was pronounced correctly by the learners in this sample. Other pronunciation errors were similar to those of the Novi Sad learners.

The word error rate for this data set is estimated to be 5.4%. This estimate was obtained by taking a sample of 100 utterances and analyzing them using the same method that was employed on the Novi Sad data. The lower error rate suggests that their speech overall was closer to that of native speakers. The number of unintelligible utterances was also lower.

The Split students made some grammar mistakes. Common grammar mistakes were omissions of definite articles, incorrect use of pronouns, omission of “to” with infinitive verbs, and word order in questions. Overall these observations suggest that the University of Split learners were somewhat higher in the CEFR B range than the University of Novi Sad learners, and thus somewhat farther from the target learners for the A-level Enskill English simulations. These of course are impressions from the data, not definitive assessments from psychometrically validated instruments.

Survey Results

The net promoter score from the survey was 5. This score is considered good but is a somewhat less positive result than was obtained from the Novi Sad students.

Most learners in this group agreed with the statement “Enskill English exercises are a good way to practice speaking English” (mean = 3.92 of a possible 5, s.d. = 0.73). Two learners disagreed, and no learner strongly disagreed. Most learners agreed with the statement “Enskill English is easy to use” (mean = 4.33, s.d. = 0.74). One learner disagreed and no learners strongly disagreed. Together these findings are consistent with Hypothesis 1: ELLs in this study considered Enskill English to be a good way to practice English.

The learners in this group were neutral toward the statement “The interactive conversations helped me with my English speaking and listening skills” (mean = 3.18, s.d. = 1.19). 9 learners disagreed and 3 learners strongly disagreed. This is somewhat lower than the rating from the Novi Sad students, perhaps because the Split students had a somewhat higher level of proficiency.

Most learners agreed with the statement “The interactive conversations are amazing” (mean = 3.51, s.d. = 0.91). Four learners disagreed, and one learner strongly disagreed.

Overall, many learners had positive things to say about the Enskill learning experience. The responses to the question “What did you like about Enskill English?” fell into the following categories: conversations with virtual characters (10), easy to use, fun (8), realistic conversations (7), it is a good way to learn (5), it helps learn vocabulary (3), interactive exercises (2), it is interesting, innovative (2), and it is good for beginners (1).

Responses to the question “If you could change, add, or remove anything, what would it be?” fell into the following categories: add more conversations (7), add more content (5), more challenging exercises (4), improved conversations (3), higher levels of English (2), more challenging exercises (2), better mix of exercises (2), less challenging exercises (1), more challenging conversations (1), improved user interface (1), improved speech recognition (1), technical improvements (1), a final test or exam (1), and no changes (5). Responses to the question “What do you find frustrating?” fell into the following categories: dialogues accept a limited range of inputs (6), microphone problems (6), exercises accept a limited range of inputs (5), issues with speech recognition and transcription (4), content was too easy (4), content was repetitive (3), network problems (3), system was slow (2), limited range of topics (1), and no issues (3).

Instant Test with Instructors

Table 10 shows raw understanding rates for instructors at the University of Split who were reviewing Enskill English. None of them were native English speakers. The raw understanding rates in the dialogues were quite high, approaching 100%. Based on this experience, the University of Split instructors decided to proceed with the learner trial.

Table 10 Raw understanding rates for the first University of Split users

Learner Performance Analysis

The A study group had a slightly higher raw understanding rate than the B group (85% vs. 82%). However only 4 students in the B group completed the practice exercises before attempting the A1-level simulations, and only 3 students in the B group completed the practice exercises before attempting the A2-level simulations. None of the students in the A group completed the practice exercises first. Thus there were insufficient data points to perform a statistically meaningful A/B evaluation. The two groups are combined in the following analyses.

As shown in Table 11, raw understanding rates varied by simulation, from 63% to 94%. Overall, average raw understanding rate across the data set was 84%, compared with 71% for the Novi Sad data set. The updated NLU appears to have improved NLU performance, although the difference may be due in part to differences in language proficiency between the two groups. The added simulation (School Newspaper Interviews) has a very high raw understanding rate, but even excluding that the average understanding rate improved. Overall the classification-based NLU’s interpretation was used on 35% of the utterances, the partial-match NLU was used on 50% of the utterances, and 15% of the utterances did not receive any interpretation.

Table 11 Raw understanding rates for the University of Split group

Table 12 summarizes the simulation activity of the first 15 learners using the CEFR A1 simulations and the first 16 learners using the CEFR A2 simulations. This data set includes 67 trials of CEFR A1 simulations and 66 trials of CEFR A2 simulations. The learners attempted most simulations just one time. Some simulations were attempted multiple times; one student attempted a CEFR A1 simulation seven times, and another attempted a CEFR A2 simulation seven times.

Table 12 Summary of simulation activity for a sample of learners

Table 13 shows the overall performance of these learners in the simulations. The average completion rates are similar to the Novi Sad sample. Elapsed time to complete a simulation decreased from an average of 4:24 to 3:44.

Table 13 Improvements in performance due to practice

Table 14 shows a more detailed analysis of performance in the dialogues. The number of repeats per simulation in the Split group is lower than that in the comparable Novi Sad group (2.79 vs. 4.56). The number of conversational exchanges per minute for the Split group increased from 3.37 on the first trial to 4.26 on the last trial. The meaningful exchange rate increased from 77.78% to 88.78%. Inspection of the data revealed that some learners got stuck and repeated the same intent multiple times, driving up repeat rates. This suggests that improvements to the dialogue system are needed to help learners get unstuck.

Table 14 Improvements in conversational exchanges due to practice

Table 15 shows the success rates for spoken language practice exercises. The sample size is much larger than was the case for the Novi Sad study group, so it is more likely to provide an accurate estimate of success rates. The success rates are similar to what was found with the Novi Sad group.

Table 15 Success rates for spoken language practice exercises

NLU Test Results

Table 16 shows key statistics for the NLU models used in this evaluation. The NLU models had been simplified, removing overlapping intents. The table also shows the size (number of training utterances) of each model as of March 2019.

Table 16 Classification-based NLU model statistics

Table 17 summarizes the results from transcribing and annotating 327 learner utterances from the A1 simulations. 86.0% of the NLU intent assignments agreed with human annotators and were judged as correct. The false positive rate remains very low (0.6%).

Table 17 Comparison of NLU output with intent annotations by human annotators

Table 18 shows raw understanding rates for a comparable set of A-level learners at the Universidad Privada del Norte, from the period between October 15, 2018 and December 10, 2018. The understanding rates for these learners are similar to those for the University of Split students. These learners worked only with the simulations and did not use the language practice exercises.

Table 18 Raw understanding rates for UPN users

Discussion

The data and surveys collected during these snapshot evaluations show that Enskill English is making progress as an engaging, useful learning tool. Most learners felt that it was a good way to learn English, and found it fun and easy to use. The changes to the NLU that were tested in snapshot 1 appear to have had some positive effect on the overall system performance in snapshot 2. Raw understanding rates and meaningful exchange rates in dialogues averaged above 80% and were increasingly in the 90% range. However precision and recall rates for annotated samples to the two data sets were comparable; precision rates are very high while recall rates have room to improve.

These snapshot evaluations deliberately tested Enskill English with learners whose proficiency was higher than the current simulations are designed for. Nevertheless the system performed as well with CEFR B-level learners as it did with A-level learners. Many learners wanted to see it expanded and extended to higher levels of language proficiency. The findings from these studies will inform system improvements that make it possible to use Enskill English in CEFR B level curricula, and will likely improve system performance in A-level curricula.

Data from the University of Split showed that the dialogues performed well with the instructors. This is critical for adoption. Here are some of their comments from the program directors:

  • Enskill is a great education tool that helps students practice speaking in real-life situations.

  • I really like the idea of teaching soft skills through simulations that would be also useful in teaching languages for specific purposes (Business English for example).

Analysis of the data collected in these snapshot studies yield analytics that indicate that learner performance improves with practice. Analytics like these now need to be provided to learners, instructors, and institutions. Learner performance needs to be analyzed in further detail to understand how and why learner performance is improving.

Some of observations in this article are based on comparisons between different system versions on different populations. We are currently improving the data management infrastructure in Enskill to make it easier to test new system versions on archival data.

Upcoming Iterations and Other Future Work

We are currently incorporating analytics such as exchanges per minute and repeat rates into the feedback that learners and instructors see. We continue to make improvements to Enskill’s natural language understanding pipeline, both by retraining NLU models on new utterances and utilizing the learner language error data that we have collected in new ways. This sets the stage for the next series of evaluations, one of which is now underway in May 2019 at a school in Sweden.

The Swedish trial is taking place over a series of weeks with students aged 13 to 14, in the upper level of a primary school (grundskola in the Swedish system). Students are using a new version of Enskill that includes a dashboard that reports time spent practicing each simulation, objectives completed, and performance (progress toward mastery). This version of Enskill also includes some improved NLU models that have been trained on additional learner data. We are investigating the following research questions.

  • How does Enskill English work with younger students? This evaluation is an opportunity to collect data from younger learners and see how well the Enskill English design works with these learners. Experience with middle-school students using earlier Alelo products suggests that Enskill English will perform well with these students.

  • Will learners continue to practice simulations until they have completed all objectives and can perform the simulation at a mastery level? In the snapshot evaluations described above many learners practiced simulations just once and failed to complete all of the objectives. We hypothesize that use of the analytics dashboard will motivate learners to continue to practice and improve.

  • How does learner performance improve with practice? As learners practice multiple simulations and become familiar with the software, we predict that performance improvement will be a result of increasing mastery of the conversational tasks. We will look in more detail at analytics such as repeat rates and see how they change over time.

  • How does practice with Enskill English affect learner self-confidence and ability to communicate in English? Instructors using Enskill English in Latin America have reported that it makes their classes more communicative and increases their self-confidence (Alelo 2019). Also since learners will be using Enskill English on multiple tasks over multiple weeks, we hope to see progressive improvement in communication skills. We will interview the teacher supervising the students participating in the trial to examine what improvements she observes.

In future iterations we plan to make improved analytics available to instructors, so that they are able to better focus their instruction. For example, instructors have asked for analytics about the patterns of errors that their students are making.

We would like to further improve recall for the Enskill English NLU without sacrificing precision and improve overall raw understanding rates. Better dialogue repair strategies could reduce the number of repeats and increase understanding rates. We plan to look more closely at the repeat rates. There are two kinds of repeats: repeats requested by the learner and repeats requested by the on-screen character. Each should be considered separately.

Currently Enskill English simulations are used as practice and formative assessment tools; with some modification they could also be used as summative assessment tools. This could be accomplished by turning off the on-screen transcripts and other scaffolding, and changing the on-screen character’s dialogue so that is consistent with the target can-do statements but different from what the learner practiced. Then at the end of the course we can look at how learning has improved over the entire course of learning with simulations.

Some of the students at the University of Split reported bandwidth problems and slow load times. We have developed a low-bandwidth version of Enskill that uses still images instead of animations. We are evaluating the possibility of shifting to a low-bandwidth version, to make it more available to students that lack a high-bandwidth Internet connection.

Data annotation is important for accurately measuring dialogue system performance. It is currently performed periodically as part of snapshot evaluations. With better annotation management tools it may be possible to perform annotation earlier in the evaluation process, as immediate follow-up on instant tests. Meanwhile as we annotate and analyze more learner data we can use the findings to calculate better analytics from the instant tests, including estimates of NLU precision, recall, and true negative rates.

The language practice exercises will be improved and made appropriate for the proficiency level of the learners. Exercises that call for single-word responses will be avoided, since word error rates tend to be higher with single-word utterances. The NLU could be used to detect near matches to correct responses, and provide feedback. Finally, after a certain number of attempts the system should simply display a correct answer so that learners don’t get stuck attempting the same exercise.

The data collected in these snapshot evaluations have provided us with insights into how to extend Enskill English to support language at the CEFR B level. This will be a major focus for the future development of Enskill. For example Enskill dialogues are currently deterministic; if the learner says the same thing each time the on-screen character will respond in the same way. In the future Enskill dialogues will be nondeterministic and may randomly introduce complications that the learner must respond to. This demands a higher level of language proficiency and should result in more robust learning and ability to transfer to real-world situations.

As Enskill is extended to higher levels of language proficiency it can be applied to wider range of learning problems. Many communication tasks in industries such healthcare and hospitality, and disciplines such as sales, customer service, and management, can be modeled as conversational tasks at the CEFR B level. As Enskill is applied to increasingly challenging tasks we expect that it will prove useful as a refresher training tool and as a tool for providing just-in-time training.

Recommendations for D3

The following are some lessons learned from these evaluations and recommendations for D3 as a development method and as a form of educational design research.

The ability to test different versions of system modules on different learner data sets is essential. Tools for retrieving, processing, and analyzing data sets are therefore helpful. However testing highly interactive systems such as Enskill on archival data can be a challenge. Fortunately there are alternative testing techniques that are easier to implement, such as testing new versions in the background as in the Novi Sad evaluation.

Analysis of learner data offers insights, but should be complemented with student survey data, instructor feedback, and summative assessments. Developers of self-paced, self-study online products can embed surveys into the product. But for blended learning solutions, and other supplements to classroom instruction, input from instructors is essential, and this requires coordination with academic schedules and the cooperation of educational institutions. We are indebted to the program directors and instructors at the University of Novi Sad and the University of Split for their cooperation in the studies presented in this article.

Data collection requires scrupulous adherence to data privacy and data protection regulations such as GDPR (General Data Protection Regulation). Institutions and users must grant permission to collect data, data must be anonymized, and users must have the option of removing their data from the system if they choose. No learners participating in the studies presented here requested that their data be removed, so we were able to analyze data from all of the study participants.

D3 research is akin to iterative educational design research (McKenney and Reeves 2018), which applies to real learning settings and so can have real-world impact. But until the research paradigm is more broadly understood, potential issues can arise. The following are a few points to bear in mind.

It is commonplace in a D3 development process to defer research questions to future iterations, because insufficient data is available, because other research questions have higher priority, or simply due to lack of time. Fortunately, it is not necessary to answer all research questions within one D3 evaluation. There will be more opportunities to investigate outstanding research questions in future evaluations.

It is useful throughout the process to inspect learner data to gain further insight into the causes of the observed behaviors and develop hypotheses for further iterative evaluation. Cloud-based data archival makes data available for inspection at any time.

As the learning system evolves through the D3 process it is often desirable to go back and reanalyze data from earlier iterations, and compare findings from succeeding iterations. For example as we analyzed the data from the University of Split we went back and reanalyzed the data from the University of Novi Sad for comparison. This is another reason why it is very important to archive data in the cloud, and have analysis tools that can operate on that archived data.

D3 research is a process of discovery. It is important to document the process, and present it in the right way. Results from multiple snapshot evaluations should be presented. Each evaluation should build on the previous ones and set the stage for the succeeding ones. Taken together a series of snapshot evaluations can be used to study learning problems, test hypotheses, develop possible solutions, and evaluate the effectiveness of those solutions, in the context of learning systems that are in operational use. Moreover, D3 research is good preparation for people seeking employment in the private sector, where agile development and testing are increasingly the norm.

Conclusions

These studies showed how data-driven development can accelerate the creation of AI-driven learning environments. Cloud-based learning environments make learner data available to an unprecedented and remarkable degree to drive development and evaluation. Consider how this article has brought together analyses of learner data from nine countries, and the logistical and organization challenges that would have had to be overcome had the data not resided in the cloud.

Cloud-based data collection can help evaluate the effectiveness of learning environments, test new technologies and algorithms, and inform future development. It accelerates iterative development so that learning environments improve quickly and progressively over time. It supports ongoing evaluation instead of deferring evaluation until a project is finished. In fact, in the D3 approach the project need never be finished. As long as a system is in use and collecting data, there are opportunities to further improve and extend it.

The work described here has only begun to explore the potential of using data to track the progress of learning over time. Analytics derived from learner data can become an important resource for instructors as well as schools and organizations. Instructors will be able to focus and adapt instruction based on the progress of each learner. This will help make teaching more data-driven and more responsive to individual learner needs.

The survey findings suggest that Enskill approach has broad application to communication skills. Students and instructors both wanted to see more examples of simulations in business or professional contexts. We see potential for the application of Enskill to teaching soft skills.

For those considering the application of the D3 methodology to their AI-driven learning environments, here are some final recommendations. Look for sources of learner data early, and host the environment in the cloud as early as feasible. Set up infrastructure for archiving data, and for using data in both online and offline testing. Architect the system so that it supports evaluation as well as content delivery, and can integrate with other cloud-based services. Keep on the lookout for new AI-based services that can augment existing capabilities. But above all, treat learner data as the valuable resource that it is, and as a key to ensuring the environment is effective and successful.