Data-Driven Development and Evaluation of Enskill English

W. Lewis Johnson ORCID: orcid.org/0000-0002-9113-2417¹

2562 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

Cloud computing offers developers of learning environments access to unprecedented amounts of learner data. This makes possible data-driven development (D³) of learning environments. In the D³ approach the learning environment is a data collection tool as well a learning tool. It continually collects data from interactions with learners, which is used in ongoing evaluation and iterative development. Iterative development cycles become very rapid, limited by the time required to analyze data and deploy system updates. D³ is particularly relevant to fielded AIED systems that operate in uncontrolled conditions, where learners may behave in unexpected ways. This article presents two snapshot case studies in the data-driven development of Enskill^® English, a system for learning to speak English as a foreign language. In the first trial at the University of Novi Sad in Serbia two versions of Enskill English’s dialogue system were tested simultaneously: the released version and a new version incorporating statistical natural language processing technology. A new version was released and data were collected from a second snapshot evaluation at the University of Split, Croatia. Data from learners in Latin America and Europe were analyzed for comparison. The evaluations provided preliminary evidence that Enskill English is helpful for learning spoken English skills, and leading indicators that learner performance improves through practice with Enskill English. They suggest that Enskill English can be extended to meet the needs of more advanced learners who wish to use English in a professional context. Broader recommendations for data-driven development of intelligent learning environments are presented.

A Large-Scale, Open-Domain, Mixed-Interface Dialogue-Based ITS for STEM

Applying Artificial Intelligence in the E-Learning Field: Review Article

Validating Learning Outcomes of an E-Learning System Using NLP

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

For artificial intelligence to enter the mainstream of educational technology and become a standard element of learning environments, design methodologies are needed that enable instructional designers to efficiently create AI-driven (artificial intelligence-driven) learning products. The most commonly used instructional design methodology, and the one which most instructional designers are trained to use, is ADDIE (analyze, design, develop, implement, and evaluate) (Branson et al. 1975; Branch 2009). ADDIE is used on many projects because it is relatively easy to manage and yields predictable results. However, ADDIE and other commonly used instructional design methodologies do not address the needs and potential of AI-driven learning systems that rely heavily on data.

In ADDIE’s original conception the evaluation, or E phase, occurs mainly at the end, after the implementation is complete. Nowadays ADDIE practitioners try to start formative evaluations earlier in the process, however it can be difficult to collect formative evaluation data under conditions that are similar to actual use. Formative evaluations are often conducted in the laboratory with individuals or focus groups (Treser 2015). In our experience at Alelo, customer representatives who are not the intended users frequently test systems. Once a system is deployed on the customer’s own computing infrastructure the designers may have little or no access to data from the system in use.

Cloud computing can fundamentally change the role of learner data in instructional design. As learners access learning systems they provide a constant source of data both before and after system delivery. Cloud-based systems scale up easily, capturing unprecedented amounts of learner data. AI-driven learning environments are well positioned to take advantage of this data source.

This article describes an approach to the design and evaluation of intelligent learning environments, known as data-driven development (D³). In the D³ approach learner data inform design from the very beginning and throughout the product life cycle. Evaluation takes place continually. Development iterates rapidly, limited by the time it takes to analyze and interpret data, retrain models, and deploy updates. D³ is particularly relevant to fielded AIED systems, where learners may behave in unanticipated ways due to the environment in which they learn.

The article then presents two snapshot evaluations of Enskill^® English, an AI-driven system for learning to speak English as a foreign language. Students at the University of Novi Sad in Serbia and the University of Split in Croatia participated in the evaluations. The collected data were used in multiple ways, in what would be distinct phases in the ADDIE model but which are combined in the D³ approach. Data from learners in Latin America and Europe were analyzed for comparison. The evaluations provided preliminary evidence that Enskill English is helpful for learning spoken English skills, and leading indicators that practice with Enskill English results in improved speaking skills. The article then provides broader recommendations for the use of the D³ methodology.

The Data-Driven Development Methodology

The D³ model arose out of the recognition that constant access to data was transforming our approach to instructional design, so that it no longer conformed to the conventional ADDIE model.

Instructional design in D³ is informed by learner behavior data throughout the product life cycle. Available learner data are captured, inspected, and mined as early as possible in the process, preferably in the initial needs analysis phase before development begins. This helps to identify specific learner behavior characteristics that should be addressed by the product. Data mining and analysis confirms hypotheses about learner needs derived from SME (subject matter expert) interviews, identifies additional needs which the SME interviews failed to uncover, and helps prioritize development based on the frequency and severity of learner problems identified in the data.

This begs the question of how to obtain learner data before a learning system is developed. One way is to use archived corpora collected by other systems from learners similar to the target learner profile. Organizations such as the PSLC Datashop (PSLC DataShop 2012) and the Linguistic Data Consortium (Linguistic Data Consortium 2018) have corpora that can be used to inform instructional design. Companies record customer conversations for training purposes. Developers of learning systems that capture learner data sometimes make them available in anonymized form for research purposes. As more learning systems run in the cloud more sources of learner data may become available that can be used to kickstart the development of new learning systems.

Another approach is to build an initial partial prototype or Wizard-of-Oz mockup and use it to gather data to kickstart the design process. This was the approach used to initiate the development of Enskill English. An initial version was deployed to collect data from testers and trial users. We now use Enskill English to collect data from new groups of users, which informs further development.

Figure 1 illustrates the D³ development cycle. It consists of three activities: Data mining and analysis, Development and model updates, and Deployment. In the Data mining and analysis phase, data is analyzed to gain insight regarding learner needs. The Development phase implements or updates the learning system, informed by these insights. It often uses machine learning to train models that are used in the running system. Then in the Deployment phase learners use the updated system, which generates more learner data so that the cycle can continue.

Once the system is running data are collected and analyzed continually to identify system components that need updating, and to obtain training data necessary to perform the updates. In the case of Enskill English this analysis takes place on a weekly basis. As we extend Enskill English to new types of learners we first collect data from those learners to understand how the needs of those learners differ from those of learners who have used the system in the past.

D³ is in some ways similar to the SAM (Successive Approximation Model) method of instructional design (Allen 2012) in that it applies principles of agile software development (Beck et al. 2001), is highly iterative, and values working software at each iteration. Other methodologies for intelligent tutoring system development, such as the Design Implementation Framework (Stone et al. 2018), are iterative. D³ goes a step farther by valuing not just working software but also data at each iteration. It requires a rethinking of the role of data in the system lifecycle. In the past if systems generated data at all they were considered a byproduct and discarded. In the D³ approach data instead become an important input, driver, and result of system development.

D³ supports iterative educational research in real educational settings, similar to educational design research (McKenney and Reeves 2018). The data collected in each iterative development cycle is used to test hypotheses about the learning solution, and develop new hypotheses for testing in subsequent iterative cycles.

Some iterative, data-driven development efforts have been described in the literature. Amershi and Conati (2011) describe a two-stage data development process in which learner data are first analyzed off line to identify clusters of students, and then those clusters are used online to classify learners. Fournier-Viger et al. (2011) describe a two-stage process in which learner interaction data is mined for patterns, and then the tutoring system recognizes those patterns in the interactions of other students. Koedinger et al. (2013) describe data-driven optimization of intelligent tutoring systems but not iterative evaluation of successive versions with learners. Yet projects that engage in educational data mining (EDM) more typically report results from the completed system and do not describe how data mining informed iterative development either before or after deployment. Describing the process is valuable so that practitioners can better understand how to implement the process and transition to a more data-driven methodology.

Overview of Enskill English

Enskill English illustrates how development and evaluation takes place in the D³ development cycle. It is an immersive learning environment for developing spoken proficiency in English as a foreign language. It runs on the Alelo Enskill learning platform, a cloud-based AI-driven platform for developing communication skills. Enskill English is a finalist for the British Council’s 2019 award for digital innovation in English language teaching (British Council 2019).

Of the four main language skills—speaking, listening, reading, and writing—speaking is by far the most important (Thiriau 2017). Yet English language learners (ELLs) around the world struggle to develop their speaking and listening skills. They have limited opportunity to practice with native English speakers, and must practice with their English teacher or with other learners. The problem is particularly acute for large classes where speaking-practice activities are hard to organize and manage.

Enskill English gives learners opportunities for realistic spoken English practice with animated characters that speak and understand English (Fig. 2). They can practice as much as they want in a safe environment, which builds proficiency and self-confidence. It also can reduce anxiety about speaking, which affects foreign language achievement and student attrition (Bailey et al. 2003; Hsieh 2008).

Enskill English’s AI serves as an aide intelligente, not as an artificial instructor. It is intended to act as an intelligent teacher’s aide that gives each learner individual attention while saving teachers time and effort so that they can teach more effectively. It automatically provides feedback and personalized instruction in the areas where learners need to improve. In future versions teachers will receive analytics about their students’ progress, effort, and performance, based in part on the analytics presented in this article, so that they can adapt their instruction to the needs of their students.

Learners access Enskill English through a web browser on their computer or mobile device. They converse with interactive characters by speaking into a microphone. The on-screen character interprets the learner’s speech and responds, and at the same time evaluates the learner’s communication skills. At the end of the exercise Enskill provides feedback and recommends exercises for further practice.

Enskill English is structured as a collection of task-based simulation modules. In each simulation the learner has a task to perform, such as buying a train ticket or getting directions to a destination. Simulations are organized into proficiency levels. The Common European Framework of Reference (CEFR) defines three levels for language proficiency: A (basic user), B (independent user) and C (proficient user). Each level is further subdivided as follows: A1, A2, B1, B2, C1, and C2. At the time of this writing Enskill English simulations are available for A1 and A2 levels, with more under development. Each level covers a semester-length course in English as a foreign language.

Each simulation is aligned with one or more CEFR can-do statements (ALTE 2002). One can-do statement at the A2 level is “can ask for basic information about travel and buy tickets.” In the corresponding Enskill English simulation the learner asks for information about trains and purchases a ticket. When learners master a simulation it shows they really can do what the can-do statement requires. Previously can-do statements served mainly as subjective self-assessments (ACTFL 2015).

Figure 3 shows an Enskill English simulation in which the learner’s task is to buy a train ticket to New York. The ticket agent has asked the learner where he is traveling to, and so the learner should tell her where he wants to go. Communicative functions such as telling someone where they want to go can be expressed in many different ways. Enskill provides each learner the minimum prompting necessary, and utilizes natural language understanding so that it can interpret a wide range of responses. This contrasts with other language learning products (e.g., Rosetta Stone, DynEd, Babbel, etc.) where learners select from a list of pre-authored choices and either read them aloud or click on them. If the learner needs help Enskill will offer some suggested responses (Fig. 3 top right), but the learner is not limited to these options. The transcript and hints are displayed in this example for the purpose of illustration; learners are encouraged to practice without reliance on hints and transcripts.

Unlike chatbot applications designed for native speakers, Enskill English is designed to be usable by both language learners and language teachers. It must support both because if it does not perform well for language teachers they will not recommend it to their students. To support language learners it must be highly tolerant of language errors in the context of conversation. Figure 4 shows an utterance with multiple errors, transcribed as “I want to leave at first my.” The learner has used the preposition “at” instead of “on”, and meant to say “1st May” instead of “May 1st”. The misrecognition of the world “May” as “my” is likely due to a mispronunciation of the “ay” vowel in the word “May” as /ai/. Enskill incorporates a library of common word recognition errors which aids in understanding.

Another key difference from generic chatbot technology is the ability to assess the learner’s language skills and provide feedback and personalized instruction. After each simulation Enskill gives feedback about which task objectives the learner failed to satisfy. The feedback is provided after the dialogue has ended to avoid interrupting the conversation’s flow. Learners then have the option of trying the simulation again or completing practice exercises that focus on the language skills required to complete the unsatisified objectives. Each level has a bank of exercises from which to choose.

Figure 5 shows one such exercise where the learner practices asking about train arrival times. This type of exercise is called an utterance formation exercise; the learner is prompted to produce a spoken utterance that conveys a particular meaning or intent. In these exercises there is typically no one right answer. For example, instead of saying “What time will the train arrive?” the learner can also say “When will the train arrive?” Several equivalent responses are accepted, but the learner’s response must match one of the expected responses exactly.

The Enskill English instructional approach reflects the characteristic of spoken proficiency as both a motor skill (the ability to pronounce words) and a complex cognitive skill (the ability to understand and produce language). According to Fitts and Posner (1967), motor skill learning progresses through stages from the cognitive stage through the associative stage to the autonomous stage. Anderson (1982) generalized this to acquisition of cognitive skill, and described a process of proceduralization and knowledge compilation. These learning processes require practice.

The learning methodology in Enskill English is influenced by the instructional design methodology of van Merriënboer (1997), who argues that complex cognitive skills are best learned through a combination of whole-task practice and part-task practice. Dialogue simulations provide whole-task practice, while practice exercises provide part-task practice.

Enskill English contrasts with common apps such as Babbel, Rosetta Stone and Duolingo which offer some opportunities to practice words and phrases but few opportunities to practice speaking, to construct and produce their own responses without relying on prompts. Supiki (http://linguacomm.com), a mobile phone app that supports simulated conversations, accepts nonsense inputs and does not provide feedback.

A number of computer-aided language learning systems focus on pronunciation feedback, for example see (Engwall 2012). Learners are presented with sentence to read and are scored on their pronunciation. This focuses on language as a motor skill and overlook the key cognitive skills of listening comprehension and sentence production. Learners who focus on pronunciation but neglect these other skills will likely struggle in real-world conversation.

Enskill builds on previous work on immersive training systems for languages and cultures, including Tactical Language (Johnson 2010; Johnson et al. 2012) and the Virtual Cultural Awareness Trainers (VCATs) (Johnson et al. 2011). Evaluations of these systems (MCCLL 2008; Johnson 2015) have indicated that they help learners acquire skills that they apply when they engage with people in other countries. The evaluations of Enskill English in this article provide preliminary indications of that it too is effective, although further investigation of effectiveness is still necessary.

Enskill also builds on research in conversation-based sustems such as Subarashii (Bernstein et al. 1999), Mercurial (Xu and Seneff 2011), PONY (Lee et al. 2014), and HALEF (Evanini et al. 2017). These are promising prototypes but have not been applied at scale. Many previous systems (e.g., Subarashii, Tactical Language, PONY) match learner utterances against a library of pre-authored phrases, a technique that works well for basic language spoken by beginner learners but does not extend well the complex language typical of higher levels of language proficiency. A major goal of Enskill is to understand and respond to complex learner language, up to the CEFR B level. Some systems (e.g., Mercurial) use speech and language technology that is trained on native speakers; this can result in high word error rates on learner speech. Enskill’s speech and language technology is trained on native speakers but adapted for use by language learners. Previous systems have been tested on a small number of sample tasks; Enskill has been applied to entire course curricula.

Dialogue systems such as ITSPOKE (Littman and Silliman 2004), AutoTutor (Graesser et al. 2005), and BEETLE (Dzikovksa et al. 2011) are becoming increasingly common in educational applications, and toolkits such as Alexa Skills Kit make it ever easier to add dialogue capabilities to applications. But these lack the assessment and feedback capabilites that are essential for learning communication skills.

The Enskill Platform

Enskill English runs on the Enskill cloud-based learning platform for communication skills. Enskill is a multilingual platform configurable for a variety of native languages and target languages. It was recently evaluated with American learners of Modern Standard Arabic at the Army language school at Ft. Bragg, NC (Johnson et al. 2018). The Enskill DLE is used in Alelo’s Virtual Cultural Awareness Trainers (VCATs) (Johnson et al. 2011), immersive cultural awareness trainers that are available for over 90 countries and have been used by over 200,000 learners to date.

Figure 6 shows the Enskill system architecture. The core of the system is the Enskill SimServer, which is responsible for delivering simulation content to learners, providing computational support to run them, collecting performance data, and generating analytics. Enskill SimServers are currently running in North America, South America, Europe, and Southeast Asia to support learners throughout the world. Additional servers can be added as needed to serve learners in specific countries. The Enskill SimServer can interoperate with learning management systems (LMS) and other digital learning products. It supports the Learning Tools Interoperability (LTI) standard; support for the Caliper is in development. Enskill supports single sign-on so that learners can log into their institution’s LMS and access Enskill simulations interleaved with other learning activities.

The Enskill DLE (Digital Learning Environment) is an HTML5-compatible runtime player that runs learning content in the learner’s Web browser and captures learner speech and interaction data. It is designed to run on any device and Web browser that supports microphone input, including Windows laptops, Firefox and Chrome on MacBooks, and Chrome on Android devices. We are also developing an immersive 3D interface in Unity for virtual-reality environments.

Authors create content for Enskill using the Enskill Builder. The current version of the Enskill Builder (Johnson and Valente 2008) is only available internally to Alelo personnel. Alelo intends to release a licensed version that enables customers to collaborate with Alelo to develop content, or develop content on their own.

The transition of Alelo technology to the cloud has accelerated innovation and made data-driven development possible. Earlier Alelo systems were delivered as self-contained learning products hosted on client computers and networks. This made access to learner data very difficult. Clients would not request updates so the systems tended to fall out of date. In contrast, the Enskill SimServer processes all learner speech data and automatically archives them for future analysis. When updates to Enskill are released into the cloud, learners get access to them immediately.

The Enskill SimServer uses cloud-based commercial speech and language processing services (sometimes referred to as cognitive services) to assist in the processing of learner responses. This allows to take advantage of the continuing improvements in such services. We adapt and extend these, using the data that we collect from language learners. When learners with heavily accented speech use Enskill, the speech recognizer often mis-recognizes some words. For example many ELLs have trouble pronouncing English short vowels, so the speech recognizer can mis-recognize the word “cups” as “cops”. To compensate for this we have used the data we collect to create a substitution table of common mistranscriptions of learner language, which a partial matching algorithm uses to find the best interpretation of the learner’s utterance. For example, if the speech recognizer transcribes a learner’s utterance as “I want a leaf on May 1st”, the substitution table identifies “leaf” as a possible mispronunciation of “leave”, and so Enskill can understand that the learner meant to say that she wants to leave on May 1st. This substitution table continues to improve as Enskill English collects more examples of mispronunciations. This approach has the advantage that it can be applied in the context of realistic dialogue; unlike computer-aided pronunciation training algorithms (e.g., Arora et al. 2017) it does not require learners to read text prompts off the screen.

The partial matching algorithm described above is an example of the data-driven development (D³) methodology at work. Data informed its initial design, and data informs its iterative development as well as its eventual replacement by new natural language understanding (NLU) algorithms.

Continuous Evaluation in D³

Although the learner data that D³ collects facilitates evaluation, rapid iterative development can make evaluation a challenge. The system being evaluated is a moving target. It is important for evaluations to provide actionable findings quickly, to inform development. D³ addresses this issue through a process of continous evaluation.

D³ utilizes multiple types of evaluations. In what follows tests refer to evaluations that analyze only learner data, while evaluations in general may analyze other data sources such as surveys and interviews in addition to learner data. Tests rely heavily on automated data analysis, while evaluations in general may employ mixed-methods approaches.

At the lowest level in the evaluation process is a series of instant tests: evaluations of learner and system performance at the current point in time. For Enskill English instant tests are performed weekly, analyzing data that was captured over the preceding week. Instant tests are limited to analyses that can be performed automatically, supplemented by spot inspections of the incoming data. Analysis activities such as rating student answers, transcribing recordings, and tagging data require effort, coordination, and time, so it is difficult to make use of them in instant tests. Instant tests can however identify data sets that warrant follow-on analysis.

Additionally, periodic snapshot evaluations are performed over a limited period of time with a selected learner population. Data and analytics from instant tests of the target population are aggregated and further analyzed. Feedback and survey data from students and instructors provide further insights about the strengths and weaknesses of the system. We then update the system and its underlying models. Data analysis generates new hypotheses. Subsequent instant tests and snapshot evaluations determine whether the system changes had the intended effects, and test the new hypotheses. Note that while instant tests can be performed at any time, snapshot evaluations must be scheduled in coordination with instructors and are subject to the constraints of the academic calendar.

Because all learner data are archived it is also possible to test multiple system versions on the same learner data sets. We call these tests regression tests. They help ensure that the observed differences in performance are due to system changes and not differences between learner populations.

Finally, D³ can make use of A/B evaluations. In A/B evaluations two different versions of the system are tested simultaneously with different sets of users (Kohavi and Longbotham 2017). Instant tests and snapshot evaluations can both in principle be conducted in A/B fashion, but since they run for a short period of time they can yield small sample sizes that make it difficult to identify statistically significant differences between A and B.

The snapshot evaluations described in this article were conducted in 2018. The first evaluation was conducted with a group of students in English for special purposes at the University of Novi Sad in Serbia. It utilized learner data collected by Enskill as well as survey data from the learners. The analyses were compared against instant tests of another set of learners in the Laureate International Network. Then in late 2018 another snapshot evaluation of the improved system was conducted with students at the University of Split in Croatia. Results were compared against instant tests with learners at the Universidad Privada del Norte in Peru. This evaluation tested the effects of previous changes that were made to the system and informed subsequent development.

The rapid evaluations in D³ necessitate tradeoffs between speed and thoroughness of analysis. Large data sets take time to accumulate, so D³ evaluations must make do with smaller data sets. High-quality analyses with human raters and transcribers take time and so can be integrated only to a limited extent. As better tools are developed, more of the analysis can be automated and moved into the instant tests and the time to complete the snapshot evaluations is reduced.

The choice of evaluation metrics also evolves over time, as our understanding of learner behavior and system performance improves. The portrayal of D³ in this article, with similar metrics applied at each snapshot, is a simplification for the purpose of exposition. The reality is that everything is evolving—the learning system, the analysis tools, and the evaluation metrics. Metrics that were useful in the earlier snapshots have been discarded and replaced by new and better ones.

The snapshot evaluations described in this article were conducted with learner populations that are different from what Enskill English was originally designed for. It was designed for beginner learners but was tested with intermediate learners. This may seem surprising—why design a learning system for one learner population and test it on a different learner population? The reason is that it enables us to iteratively extend Enskill English to these new learner populations. This illustrates the dual roles of evaluation in D³: to evaluate the current version of the system and to collect data to inform development of future versions of the system.

The current learning levels in Enskill English take a full semester a complete. That is too long a period of time for a snapshot evaluation. So instead we look for leading indicators of learning, preliminary evidence that learning is taking place. If intermediate summative assessments are integrated into the curriculum the results of those assessments can be used. Long-term effectiveness evaluations come later, as data from long-term use become available.

Snapshot Evaluation 1 at the University of Novi Sad

The first snapshot evaluation of Enskill English to be discussed here took place in April and May of 2018. The study population was a class of 80 students in an English for information technology purposes (EiT) program at the Faculty of Technical Sciences of the University of Novi Sad in Serbia. These students had passed an English placement test and were able to understand English at the CEFR B1 (i.e., intermediate) level. One was a native of Bosnia and Herzegovina, the rest were natives of Serbia. All were native Serbian speakers.

At the time of this evaluation the CEFR A1 level of Enskill English was in use by beginner learners and the A2 level was nearing completion. We were interested in extending Enskill to the B level, but we did not yet have data from B-level learners. We did not know whether B-level learners would find the A-level simulations useful, and where they might fall short. However, the director of the English program at the University of Novi Sad, Ms. Vesna Bulatović, indicated that students had limited opportunities to practice speaking English, so we hypothesized that the existing A1 and A2 level simulations might be beneficial for them. Ms. Bulatović cooperated with Alelo in collecting student surveys and ensuring that the students participated in the evaluation.

The evaluation objectives of the study were to test the following hypotheses:

Hypothesis 1. ELLs at the CEFR B level consider Enskill English to be a good way to practice English. They find it useful, fun, and easy to use.
Hypothesis 2. B-level ELLs benefit from practice with Enskill English A-level simulations.

As of the time of this study 1800 A-level English learners and their instructors had already used Enskill English, and the responses had been positive so far. So we expected this study to support Hypothesis 1. The status of Hypothesis 2 was much less certain, since the A-level simulations were not designed for B-level learners. The following are possible reasons why Hypothesis 2 might be true:

Learners can use simulations for review, to maintain mastery of language skills covered earlier in the language course.
They can practice vocabulary and structures common to real-world situations.

The following are some possible reasons why the study might not confirm Hypothesis 2:

B-level learners might regard the A-level conversations to be too easy for them.
B-level learners might use complex language that the A-level simulations could not handle.

The data collection and system testing objectives of the study were as follows:

To collect data from native Serbian speakers, in case the speech and language models need to be updated and retrained for Serbian speakers;
To collect data from B-level ELLs to inform future development; and
To test a new version of the Enskill dialogue system and compare it against the current version.

The remainder of this section gives an overview of the study design and then summarizes the survey findings. It then analyzes learner performance during the trial, as inferred from the learner data. It summarizes the characteristics of the speech data collected from the B-level Serbian speakers. Finally, it presents results from tests of the two versions of the dialogue system, starting with instant tests and then continuing with an in-depth analysis to interpret the instant test results and compare system performance. This is an illustration of the process of data-driven design research in D³.

Study Design

The study materials included two simulation modules at the A1 level and two simulation modules at the A2 level. The A1 simulations had been previously released; the A2 simulations were still beta versions that were in the process of final testing. The study lasted three weeks. During the study period the learners were directed to practice each simulation as a homework assignment until they could complete all the objectives. They were free to practice the simulations more times if they wished.

After the learners completed the trial they completed a survey of their attitudes toward Enskill English and the simulations. The survey included 5-point Likert-scale questions and free-form questions in which learners could describe what they liked about Enskill English and what they would like to see improved. The survey included a net promoter score question to determine whether they would recommend Enskill English to their family and friends.

Enskill kept track of the number of times each learner completed each simulation and whether the learner successfully completed all the task objectives. It also recorded the date and time of each learner’s first and last task completion, as well as the total time spent. We provided these statistics to the instructor so she could track the learners’ usage of the courseware.

During the trial the partial-matching NLU drove the on-screen character behavior. The substitution table included errors from speakers of several languages (Portuguese, Spanish, Thai, and Turkish), but none from Serbian speakers. A new classification-based NLU engine ran in the background on the same learner utterances. The classification-based NLU utilized Microsoft’s LUIS (Language Understanding Intelligent Service) package. Like most commercial providers of cognitive services, Microsoft does not disclose which particular machine learning method it uses in LUIS.

Survey Results

A total of 72 learners completed the completed the survey. 71 attempted the A1 simulations and 66 attempted the A2 simulations.

The net promoter score (NPS) from the survey was 10. An NPS above zero is considered good. There were 27 promoters (score of 9 or 10), 28 passives (score of 7 or 8), and 17 detractors (score of 0 to 6). NPS is calculated as (promoters – detractors) / total * 100.

Most learners agreed with the statement “Enskill English exercises are a good way to practice speaking English” (mean = 4.03 of a possible 5, s.d. = 0.69). Only one learner disagreed, and no learner strongly disagreed. Most agreed with the statement “Enskill English is easy to use” (mean = 4.06, s.d. = 0.85). Two learners disagreed and two learners strongly disagreed. These findings support Hypothesis 1: ELLs in this study considered Enskill English to be a good way to practice English.

Most learners agreed with the statement “The interactive conversations helped me with my English speaking and listening skills” (mean = 3.49, s.d. = 1.09). 8 learners disagreed and 5 learners strongly disagreed. This is consistent with Hypothesis 2 for most learners, under the assumption that if learners felt that the simulations helped them, they probably did help them. Conclusions from such self-reports should be considered tentative, although self-assessments are common practice in second language learning. Rubrics of self-assessed can-do statements are frequently used to evaluate spoken proficiency (ACTFL 2015), because there is currently no practical alternative. Teachers do not have time to assess each student through one-on-one conversation. Objective proficiency tests require scoring by trained human raters, and do not measure spoken competencies at a granular level.

Some learners commented that the dialogue choices were too limited and restrictive. This is not surprising because the NLU pipeline was designed to understand the relatively simple language of learners at the CEFR A level. Some learners also encountered bugs in the A2 beta simulations.

Most learners agreed with the statement “The interactive conversations are amazing” (mean = 3.83, s.d. = 0.91). Nine learners disagreed, but no learner strongly disagreed. This wording was chosen to get a sense of the learners’ subjective experience of practicing with AI-driven characters.

Here are some comments from the learners about what they liked about Enskill English:

It is very imteresting [sic] and it can help you to aprove [sic] your communication skills.
It is very useful for practice.
It’s very easy to use and you can learn a lot.
The variety of the conversations.
There are very interesting conversation [sic] and I really liked it.

Here are some comments from the learners about what they would like to see improved:

Add more variety of answers and possible questions.
I would add more themes for conversation.
Find the right activity for you, improve your English writing skills, improve your English reading skills.
Maybe some more professional conversation.
I don’t know, everything is fine.

Learner Performance Analysis

Of the learners who participated in the trial, the total time spent using Enskill English varied greatly (mean = 32:31, s.d. = 28:15, min = 00:24, max = 2:41:37). The variation was due mainly to the number of times the learners practiced the simulations. 80% of the learners practiced at least one simulation multiple times. Table 1 summarizes the simulation activity of samples of 16 learners on the CEFR A1 simulations and 16 learners on the CEFR A2 simulations. This data set includes 108 simulation trials, 54 of the CEFR A1 simulations and 54 of the CEFR A2 simulations. In 18 A1 trials and 18 A2 trials a learner tried a simulation just once; in the rest of cases the learners tried simulations multiple times.

Table 1 Summary of simulation activity for a sample of learners

Data-Driven Development and Evaluation of Enskill English

Abstract

Similar content being viewed by others

A Large-Scale, Open-Domain, Mixed-Interface Dialogue-Based ITS for STEM

Applying Artificial Intelligence in the E-Learning Field: Review Article

Validating Learning Outcomes of an E-Learning System Using NLP

Explore related subjects

Introduction

The Data-Driven Development Methodology

Overview of Enskill English

The Enskill Platform

Continuous Evaluation in D3

Snapshot Evaluation 1 at the University of Novi Sad

Study Design

Survey Results

Learner Performance Analysis

Data Collection Results

Instant Test Results

In-Depth Analysis of Dialogue-System Performance

Snapshot Evaluation 2 at the University of Split

General Observations Regarding the Collected Data

Survey Results

Instant Test with Instructors

Learner Performance Analysis

NLU Test Results

Discussion

Upcoming Iterations and Other Future Work

Recommendations for D3

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Continuous Evaluation in D³

Recommendations for D³