Introduction

Virtual Learning Environments (VLE) play a crucial role in education. For instance, they enable managing educational materials and deploying assessments (Kocadere & Çağlar, 2015; Pereira et al., 2021), which is prominent for successful learning experiences (Batsell Jr et al., 2017; Mpungose, 2020; Rowland, 2014). However, educational activities are not motivating oftentimes (Palomino et al., 2020; Pintrich, 2003). That context is problematic because motivation is positively correlated with learning (Hanus & Fox, 2015; Rodrigues et al., 2021). Consequently, that lack of motivation jeopardizes learning experiences.

Empirical evidence demonstrates gamification might improve motivational outcomes (Sailer & Homner, 2020). However, such effect varies from person to person and context to context (Hallifax et al., 2019; Hamari et al., 2014). To mitigate such variations, researchers are exploring personalized gamification, especially for educational purposes (Klock et al., 2020). By definition, personalization of gamification is having designers (or the system automatically) tailorFootnote 1 the gamification design to different users/context, instead of presenting the same design for all (i.e., the one-size-fits-all (OSFA) approach) (Tondello, 2019). In practice, that is often conducted by offering different game elements to distinct users (Hallifax et al., 2019). Thus, acknowledging people have different preferences and are motivated differently (Altmeyer et al., 2019; Tondello et al., 2017; Van Houdt et al., 2020).

Despite it has been widely researched, the understanding of how personalized gamification compares to the OSFA approach is limited. Initial empirical evidence suggests personalized gamification can overcome the OSFA approach within social networks and health domains (Hajarian et al., 2019; Lopez & Tucker, 2021). However, results within the educational domain are mostly inconclusive (Rodrigues et al., 2020). Having gamification personalized to a single dimension might explain such inconclusive findings (Mora et al., 2018; Oliveira et al., 2020), given recent research highlighted the need for personalizing to multiple dimensions simultaneously (Klock et al., 2020). While preliminary evidence supports multidimensional personalization’s potential, empirical evidence is limited by either not comparing it to the OSFA approach (Stuart et al., 2020) or low external validity (Rodrigues et al., 2021).

In light of that limitation, our objective is to test the generalization of the effect of multidimensional personalization of gamificationFootnote 2 as well as investigate possible moderatorsFootnote 3 of that effect. We accomplish that goal with an experimental study conducted in three institutions. Thereby, differing from prior research in three directions. First, unlike most studies (e.g., Lavouė et al., 2018; Stuart et al., 2020), our baseline is OSFA gamification, which we implemented with points, badges, and leaderboards (PBL). That is important because PBL is the game elements set used the most by research on gamification applied to education and, overall, has effects comparable to other sets (Bai et al., 2020). Second, we differ from Hajarian et al. (2019) and Lopez and Tucker (2021) in terms of context (i.e., education instead of dating/exercise). That is important because context affects gamification’s effect (Hallifax et al., 2019; Hamari et al., 2014). Third, Rodrigues et al. (2021) studied multidimensional personalization in a single institution based on a confirmatory analysis. Differently, this study involves three institutions and presents exploratory analyses to understand variations in multidimensional personalization’s effect, besides confirmatory ones. These are important to test prior research’s external validity and advance the field from whether to when/to whom personalization works (Bai et al., 2020; Huang et al., 2020; Sailer and Homner, 2020). Therefore, we contribute new empirical evidence on how multidimensional personalization of gamification, implemented according to a decision tree-based recommender system, affects motivational learning outcomes in the context of real classrooms. Thus, contributing to the design of gamified learning environments and the understanding of when and to whom such personalization is more or less suitable.

Background

This section provides background information on VLE and assessments for learning, motivational dimensions, gamification’s effect and sources of its variation, and tailored gamification. Then, it reviews related work.

Virtual Learning Environments and Assessments for Learning

VLE are essential for nowadays education. They provide better access to materials and supplementary resources, and facilitate feedback and learning outside the class (Dash, 2019; Pereira et al., 2020). They have been especially important during the COVID-19 pandemic, which forced the adoption of remote learning in many countries (Mpungose, 2020). Additionally, instructors still value students completing assignments and assessments (Mpungose, 2020; Pereira et al., 2021), which are only enabled through VLE during these times.

A theoretical perspective to understand those activities’ relevance comes from Bloom’s Taxonomy (Bloom, 1956). It classifies educational outcomes, helping instructors in defining what they expect/intend students to learn. Considering how learning objectives are commonly described, the taxonomy was revised and split into two dimensions: knowledge and cognitive (Krathwohl, 2002). The former relates to learning terminologies, categories, algorithms, and strategies knowledge. The latter refers to whether the learners are expected to remember, understand, apply, analyze, evaluate, or create. Accordingly, instructors might use assignments to encourage students in applying algorithms to new contexts or provide assessments to help students fix terminologies/strategies by remembering them.

From a practical perspective, those activities’ relevance is supported, for instance by the testing effect: the idea that completing tests (e.g., assessments/quizzes) improves learning (Roediger-III & Karpicke, 2006). Empirical evidence supports that theory, showing completing tests positively affects learning outcomes in general (Batsell Jr et al., 2017; Rowland, 2014) and in gamified (Rodrigues et al., 2021; Sanchez et al., 2020) settings. On the one hand, that is important because most educational materials are not motivating for students (Hanus & Fox, 2015; Palomino et al., 2020; Pintrich, 2003). Hence, gamifying assessments likely improves student motivation to complete a task known to enhance learning. On the other hand, research shows there are several factors that decrease gamification’s effectiveness (i.e., moderators, such as age (Polo-Peña et al., 2020) and being a gamer (Recabarren et al., 2021)), leading to cases wherein effects end up negatively affecting learning experiences (Hyrynsalmi et al., 2017; Toda et al., 2018).

Motivational Dimensions

In a literature review, Zainuddin et al. (2020) found research on gamification applied to education has relied on several theoretical models, such as Flow Theory, Goal-setting Theory, and Cognitive Evaluation Theory. Despite that, the authors found Self-Determination Theory (SDT) is the most used to explain how gamification affects users, which is aligned to its goal of improving motivation (Sailer and Homner, 2020). Therefore, this section provides a brief introduction to STD as we also understand motivation according to it.

According to SDT (Deci & Ryan, 2000), one’s motivation to do something (e.g., engage with an educational activity) varies in a continuum from amotivation (i.e., no determination/intention at all) to intrinsic motivation (i.e., an internal drive due to feelings such as enjoyment and pure interest). In between them, one experiences extrinsic motivation, which concerns four regulations: external (e.g., due to rewards), introjected (e.g., due to guilt), identified (e.g., due to personal values), and integrated (e.g., due to values incorporated to oneself). Respectively, those refer to exhibiting a behavior due to external, somewhat external, somewhat internal, and internal drivers.

Within educational contexts, research advocates towards internal drivers. Based on the SDT (Ryan & Deci, 2017), behaviors driven by avoiding punishments or aiming to receive rewards are likely to disappear once external drivers are no longer available. In contrast, SDT posits that internal regulations, such as doing something due to personal values or curiosity, are long-lasting. Accordingly, the literature considers that autonomous motivation, which encompasses regulations connected to internal drivers, is ideal for learning (Vansteenkiste et al., 2009).

Gamified Learning: Effects and Moderators

Gamification is the use of game design elements outside games (Deterding et al., 2011). Meta-analyses summarizing the effects of gamification applied to education found positive effects on cognitive, behavioral, and motivational learning outcomes (Bai et al., 2020; Huang et al., 2020; Sailer & Homner, 2020). However, these studies also show limitations of gamification’s effects, with effects varying due to geographic location, educational subject, and intervention duration. Similarly, experimental studies have found gamification’s effect ranged from positive to negative within the same sample, depending on the user (Rodrigues et al., 2020; Van Roy & Zaman, 2018). Additionally, empirical evidence shows cases wherein gamification’s impact changed depending on specific moderators, such as gender (Pedro et al., 2015), age (Polo-Peña et al., 2020) and being a gamer (Recabarren et al., 2021).

These moderators are predicted by the Gamification Science framework (Landers et al., 2018). It claims gamification affects motivation; motivation affects behavior; behavior affects cognitive outcomes; and moderators affect each of those connections. Consequently, on the one hand, analyzing behavior/cognitive outcomes without considering motivational ones is problematic: Gamification might improve motivation, but that improved motivation might not lead to the desired behavior. In that case, the problem was not gamification, but motivating some other behavior, which cannot be observed by a study limited to analyzing behavior. Therefore, gamification studies must prioritize measuring motivational outcomes aligned to gamification’s goals to prevent misleading conclusions (Landers et al., 2018; Tondello & Nacke, 2020).

On the other hand, the problem might be the gamification design itself (Loughrey & Broin, 2018; Toda et al., 2019). Empirical evidence demonstrates that different users are motivated differently (Tondello et al., 2017) and that gamified designs must be aligned to the task wherein it will be used (Hallifax et al., 2019; Rodrigues et al., 2019). Thereby, due to the moderator effect of personal and task-related characteristics, gamification designs should be aligned to the specific task and the users. However, most gamified systems present the same design for all users, regardless of the task they will do (Dichev & Dicheva, 2017; Liu et al., 2017); the OSFA approach. Therefore, the OSFA approach might explain variations on gamification’s outcomes, and cases wherein it only works for some users, based on the role of moderator characteritics.

In summary, gamification studies should focus on measuring motivational outcomes known to affect the desired behavioral outcomes (Landers et al., 2018; Tondello & Nacke, 2020). Within the educational domain, for instance, strong empirical evidence supports autonomous motivation’s positive role in learning outcomes (Hanus & Fox, 2015; Rodrigues et al., 2021; Vansteenkiste et al., 2009). Furthermore, ensuring systems provide gamification designs that consider who will use it, and to accomplish which task, is important to address limitations of the OSFA approach (Klock et al., 2020). Based on that context, this article studies personalization as a way to improve OSFA gamification. In doing so, we focus on measuring a single construct (i.e., motivation) to increase the study validity (Wohlin et al., 2012) given evidence supporting motivation’s positive relationship with learning.

Tailored Gamification

Fundamentally, tailoring gamification leads to different gamification designs depending on who/to what it will be used (Klock et al., 2020; Rodrigues et al., 2020). Tailoring gamification can be achieved in two ways. First, when users define the design, it is known as customization (Tondello, 2019). Empirical evidence supports customization’s effectiveness compared to the OSFA approach in terms of behavior (Lessel et al., 2017; Tondello & Nacke, 2020). However, there is no evidence supporting improvements are due to increased motivation or the effort put on to define their designs (Schubhan et al., 2020). Additionally, customization is subject to the burden of making users select their designs for each task, if literature suggestions of matching gamification and task are followed (Rodrigues et al., 2019; Liu et al., 2017; Hallifax et al., 2019).

Second, when designers or the system itself defines the tailored design, it is known as personalization (Tondello, 2019). In that case, one needs to model users/tasks to understand the gamification design most suitable for each case. Commonly, that is accomplished by gathering user preferences via surveys, then analyzing those to derive recommendations on which game elements to use when (e.g., Tondello et al., 2017). Most recommendations, however, guide on how to personalize gamification to a single or few dimensions (Klock et al., 2020). However, several factors affect user preferences (see “Gamified Learning: Effects and Moderators” section). Accordingly, empirical evidence from comparing gamification personalized through such recommendations to the OSFA is mostly inconclusive (Rodrigues et al., 2020). Differently, the few recommendations for multidimensional personalization of gamification (e.g., Baldeón et al., 2016; Bovermann & Bastiaens, 2020) have not been experimentally compared to OSFA gamification. The exception is Rodrigues et al. (2021), which has been validated in an initial study that yielded promising results (Rodrigues et al., 2021).

Summarizing, while customization naturally leads to gamification tailored to multiple user and contextual dimensions, it requires substantial effort from the users. Personalization mitigates that burden by using predefined rules to define gamification designs, but those rules are mostly driven by a single user dimension. That is problematic because several user and contextual dimensions affect gamification’s effectiveness. To our best knowledge, the only recommendation for multidimensional personalization to be empirically tested was analyzed in a small experimental study. Thus, highlighting the need for studies grounding the understanding of whether multidimensional personalization improves OSFA gamification.

Related Work

Empirical research often compares personalized gamification to random, counter-tailored, or no gamification (Rodrigues et al., 2020). Consequently, those studies do not add to the understanding of whether personalization improves the state-of-the-art: well-designed, OSFA gamification (Bai et al., 2020; Huang et al., 2020; Sailer and Homner, 2020). Therefore, we limit our related work review to experimental studies comparing personalized and OSFA gamification, considering those studies provide reliable evidence to understand personalization’s contribution to practice. To our best knowledge, five studies meet such criteria, which were found by screening recent literature reviews (Hallifax et al., 2019; Klock et al., 2020; Rodrigues et al., 2020) and through ad-hoc searches in recent studies not included in these reviews. Those are summarized in Table 1, which shows that related work mostly personalized gamification to a single user dimension (e.g., HEXAD (Lopez & Tucker, 2021; Mora et al., 2018) typology) and that research using multidimensional personalization either applied it to non-educational ends (Hajarian et al., 2019) or has limited external validity (Rodrigues et al., 2021). This study addresses that gap with an experimental study conducted in three institutions, comparing the OSFA approach to gamification personalized to multiple user and contextual characteristics. Thus, this study differs from Hajarian et al. (2019), Lopez and Tucker (2021), Mora et al. (2018), Oliveira et al. (2020), and Rodrigues et al. (2021) in terms of domain, personalization dimensionality, and external validity, respectively.

Table 1 Related work compared to this study in terms of the personalization strategy and the study design

As Rodrigues et al. (2021) is the most similar research, we further discuss how this study differs from it. On the one hand, Rodrigues et al. (2021) conducted an experiment (N = 26) in a single, southwestern Brazilian university. That experiment had two sessions, which happened in subsequent days, and was focused on a confirmatory analysis. That is, identifying whether students’ motivations differed when comparing OSFA and personalized gamification. On the other hand, this study (N = 58) involved three northwestern institutions of the same country. Besides encompassing more institutions, this contextual difference is relevant because Brazil is a continental-sized country. Consequently, the reality of the southwestern and northwestern regions is widely different. On the one hand, the northwestern region has nine out of the 12 Brazilian states with literacy rates below the average. On the other hand, Brazil’s southwestern region has the highest literacy average in the country (Grin et al., 2021). Additionally, we increased the spacing between sessions from one day to four to six weeks and conducted confirmatory and exploratory analyses. Considering Rodrigues et al. (2021) found personalization had a positive effect on students’ autonomous motivation, testing whether those findings hold with new students and in other contexts is imperative to ground such results (Cairns, 2019; Seaborn and Fels, 2015). Furthermore, we extend prior research’s contribution by analyzing if personalization’s effect on student motivation depends on contextual (e.g., subject under study) and user characteristics, such as gender and age (i.e., works for some but not others). This understanding is important because the effectiveness of OSFA gamification is known to depend on such factors (Bai et al., 2020; Huang et al., 2020; Sailer & Homner, 2020). Thus, we expand the literature by testing the external validity of state-of-the-art results with a new, larger sample from institutions of a distinct region, as well as shed light on when and to whom personalized gamification works.

Apparatus

To enable this study, we designed and deployed learning assessments in a VLE. All assessments featured 30 multiple-choice, four-alternative items. Items were designed so that students could correctly solve them if they were able to recall information from lessons. Therefore, the experimental task was limited to the remembering cognitive dimension while it explored varied knowledge dimensions intentionally (Krathwohl, 2002). To ensure items’ suitability, one researcher developed and revised all assessments under instructors’ guidance. A sample item, which concerns the Introduction to Computer Programming subject, reads: About operations with strings, indicate the wrong alternative: a) ‘size’ returns the number of characters in a string; b) ‘str’ converts a number to a string; c) ‘replace(a, b)’ creates a string by replacing ‘a’ with ‘b’; d) ’upper’ turns all characters in the string to uppercase. All assessments are available in the Supplementary Materials.

We deployed the assessments in the gamified system Eagle-EduFootnote 4 because the system developers granted us access to use it for scientific purposes. Eagle-Edu allows creating courses of any subject, which have missions composed of activities, such as multiple-choice items. For this study, all courses feature 10 3-item missions considering students gave positive feedback about that design in Rodrigues et al. (2021). Missions’ items, as well as items’ alternatives, appeared in a random order for all courses. Because those were assessments for learning, students could redo items they missed until getting them right. Figure 1 demonstrates the system, which can be seen as an assessment-based VLE. After logging in, the student selects the course to work on from the list that can be accessed from the top-left hamburger menu. Following, they can interact with game elements, such as checking leaderboard positions, or start a mission from the course home page (Fig. 1a and b). In the latter case, students are emerged into each 3-item mission at a time, completing multiple-choice items individually (Fig. 1). Once a mission is finished, the system goes back to the course home page and the usage flow restarts.

Fig. 1
figure 1

Screenshots of Eagle-edu

In terms of gamification design, this study used the nine game elements described next, which are based on definitions for educational environments (Toda et al., 2019):

  • Acknowledgment: Badges awarded for achieving mission-related goals (e.g., completing a mission with no error); shown at the course’s main page and screen’s top-right;

  • Chance: Randomly provides users with some benefit (e.g., extra points for completing a mission);

  • Competition: A Leaderboard sorted by performance in the missions completed during the current week that highlights the first and last two students; shown at the course’s main page;

  • Objectives: Provide short-term goals by representing the course’s missions as a skill tree;

  • Points: Numeric feedback that functions similar to Acknowledgment; shown at the screen’s top-right and within Leaderboards when available;

  • Progression: Progress bars for missions; shown within missions and in the skill tree (when Objectives is on);

  • Social Pressure: Notifications warning that some student of the same course completed a mission;

  • Time Pressure: Timer indicating the time left to climb in the Leaderboard before it resets (week’s end);

Note that each Eagle-edu course features its gamification design. Accordingly, for the OSFA condition, we implemented a single course for each educational subject. For the personalized condition, however, we had to create one course for each personalized design. Then, if the gamification design of users A and B differed by a single game element, they would be in different Eagle-edu courses. Regardless of that, students of the same subject always completed the same assessment and all courses had the same name. For instance, consider students of the Object Oriented Programming subject. Those assigned to the OSFA condition would all be in a single Eagle-Edu course. Those assigned to the Personalization Condition, for instance, could be attributed to two different Eagle-Edu courses, which is necessary because people in this condition will often use distinct designs depending on their characteristics (see “Experimental Conditions” section for details). However, all three courses would have the same assessment and name. Thus, ensuring gamification design was the only difference among conditions, as needed for the experimental manipulation. Nevertheless, that affected the Leaderboards appearance (see “Limitations and Future Work” section). The Supplementary Material provides a video and images of the system.

Method

This study involves both confirmatory (i.e., testing assumptions) and exploratory (i.e., generating hypothesis) data analysis (Abt, 1987). Accordingly and based on our goal, we investigated the following:

  • Hypothesis 1 - H1: Multidimensional personalization of gamification improves autonomous motivation but not external regulation and amotivation, compared to the OSFA approach, in gamified review assessments.

  • Research Question 1 - RQ1: Do user and contextual characteristics moderate the effect of multidimensional personalization of gamification, in gamified review assessments?

  • RQ2: How does the variation of students’ motivations change when comparing gamification personalized to multiple dimensions to the OSFA approach, in gamified review assessments?

  • RQ3: What are students’ perceptions of gamified review assessments?

H1 is derived from Rodrigues et al. (2021), which found such results in a small, single-institution study. Therefore, we aim to test whether those hold for different users, from other institutions, completing assessments of different subjects. The rationale for this hypothesis is twofold. First, autonomous motivation is considered ideal for learning purposes (Vansteenkiste et al., 2009). Second, although multidimensional personalization of gamification holds potential to improve OSFA gamification, empirical evidence is limited (Rodrigues et al., 2021; Stuart et al., 2020). Thus, testing H1 informs the effectiveness of equipping gamified educational systems with multidimensional personalization to improve autonomous motivation, which is known to mediate improvements in learning outcomes according to empirical evidence (Hanus and Fox, 2015; Rodrigues et al., 2021; Sanchez et al., 2020) and the Gamification Science framework (Landers et al., 2018).

RQ1 is based on research showing user and contextual characteristics moderate gamification’s effect (Hallifax et al., 2019; Huang et al., 2020; Rodrigues et al., 2021; Sailer & Homner, 2020). Thereby, we want to test whether the same happens for personalized gamification, and identify which factors are responsible for it. Similarly, RQ2 is based on research showing gamification’s effect vary from user-to-user and context-to-context (Hamari et al., 2014; Rodrigues et al., 2020; Van Roy & Zaman, 2018). Thus, we want to understand if personalization can adapt to such variation. RQ3 is related to research demonstrating gamification is perceived positively, overall (Bai et al., 2020; Huang et al., 2020; Sailer & Homner, 2020) and within assessments (Rodrigues et al., 2021). While that supports expecting positive results, we frame it as an RQ because we did not predict this result before our data analysis.

Based on those RQs and H1, we designed and conducted a multi-site, experimental study, following a mixed factorial design: gamification design (levels: OSFA and Personalized) and session (levels: 0 and 1) were the factors. For design, participants were randomly assigned to one of the two conditions (between-subject) at each trial, while sessions 0 and 1 refer to mid-term and end-term assessments (within-subject), respectively. Figure 2 summarizes this study, which received an ethical committee approval (CAAE: 42598620.0.0000.5464).

Fig. 2
figure 2

Study Overview. Institutions are Federal University of Roraima (UFRR), Federal University of Amazonas (UFAM), and Amazonas State University (UEA). Subjects are Object Oriented Programming (POO), Introduction to Computer Programming (IPC), Programming Language 2 (LP2), Computer and Algorithms Programming (PCA)

Sampling

We relied on convenience sampling. Researchers contacted four fellow instructors, presented the research goals, and proposed applying review assessments for their students in two lessons. All contacted instructors agreed without receiving any compensation. Those worked in three institutions (Federal University of Roraima, Federal University of Amazonas, and Amazonas State University) and were responsible for four different subjects (Object Oriented Programming, Introduction to Computer Programming, Programming Language 2, and Computer and Algorithms Programming). Thus, all eligible participants were enrolled in one of the four subjects from one of the three institutions (see Fig. 2). For each trial, we sent the characterization survey, about a month before session 0, and asked students to complete it by the weekend before the first session to enable registering students into the system.

Participants

After the four trials, 151 students completed the measure. Four of those were excluded because they did not use the system. Another 54 were excluded due to late registration (i.e., completing the characterization form after the deadline), which made random assignment unfeasible. Nevertheless, they participated in the activity with no restriction as it was part of the lesson. Finally, 35 students were excluded because they participated in a single session, leading to our sample of 58 participants: 26 and 32 in the OSFA and Personalized conditions, respectively (see Table 2 for demographics). Students from Federal University of Amazonas and Amazonas State University received points (0.5 or 1%) towards their grades as compensation for participating in each session. We let that choice up to instructors. Additionally, note the within activity performance did not count towards any student’s grade to mitigate biases.

Table 2 Participants’ demographic information

Experimental Conditions

We designed two experimental conditions, OSFA and Personalized, which differ in terms of the game elements they present to users. We implement those by changing the game elements available considering it is the common approach (Hallifax et al., 2019). The OSFA condition featured Points, Badges, and Leaderboards (PBL), similar to Rodrigues et al. (2021). PBL are among the game elements used the most in gamification research and together provide effects comparable to other combinations (Bai et al., 2020; Gari et al., 2018; Venter, 2020; Zainuddin et al., 2020). Therefore, we believe PBL offer external validity as an implementation of the standard, OSFA gamification because it is similar to most gamified systems used in practice (Wohlin et al., 2012). Accordingly, we defined that the personalized condition would have the same number of game elements as the OSFA. This ensures the only distinction was which game elements were available, mitigating confounding effects that could emerge from comparing conditions with different numbers of game elements available (Landers et al., 2015).

The Personalized condition provided game elements according to recommendations from decision trees built by prior research (Rodrigues et al., 2021), following the same procedure as (Rodrigues et al., 2021). Such recommendations are based on conditional decision trees (Hothorn et al., 2006) that were generated in two steps. First, the authors used a survey to capture people’s top-three preferred game elements for different learning activity types and their demographic information. According to the authors, this survey was disclosed through Amazon Mechanical Turk, a crowdsourcing platform that has been widely used to increase the external validity of such approach (Rodrigues et al., 2021). Second, the authors generated conditional decision trees with these data. In particular, they created three trees, one for each of the top-three spots collected in the survey. Then, according to how conditional decision trees are created, the validation process relied on the null hypothesis significance testing framework (Hothorn et al., 2006). Accordingly, they were validated based on whether an input significantly affected outputs’ accuracy. Hence, maximizing generalization based on the assumptions underlying inferential statistics (Sheskin, 2003).

After that validation process, the variables composing the trees’ inputs, which are significant predictors of their outputs, are: user’s i) gender, ii) highest educational degree, iii) weekly playing time, iv) preferred game genre, v) preferred playing setting, and vi) whether they already researched gamification (see Table 2), vii) the country where gamification will be used and viii) the learning activity type in terms of the cognitive process worked while doing it, according to the processes described in the revision of Bloom’s taxonomy (Krathwohl, 2002). Those were input according to the values presented in Table 2 (e.g., age was a numeric value, while preferred playing setting was either singleplayer or multiplayer), aiming to enable personalizing to individual and contextual differences, as proposed in Rodrigues et al. (2021).

Additionally, Rodrigues et al. (2021) discuss that these variables were selected following research demonstrating that demographics, gaming preferences/experience, and contextual information should be considered when designing gamification designs for the educational domain (Hallifax et al., 2019; Klock et al., 2020; Liu et al., 2017). For instance, take researched gamification. Rodrigues et al. (2021) first asked participants for how many years they had worked or scientifically researched gamification. Then, because most participants answered they had zero years of experience, the authors coded this variable as yes (more than zero) or no (zero). Despite that, their analyses revealed this binary variable plays a significant role on people’s preferences. Hence, we similarly asked our participants if they had worked or scientifically researched gamification, considering, for example, it acknowledges the possibility that a student with experience in gamification design might have different preferences compared to people without such experience.

Importantly, country and learning activity type do not appear in Table 2. The reason is that - due to our experimental setting - those were fixed: all participants were Brazilians and the learning activity was limited to the remembering dimension of Bloom’s Taxonomy. Considering those fixed information, we analyzed the decision trees and identified that, for our sample and experimental task, selecting the game elements to be available for each user only depends on whether one’s preferred game genre is action or not, their gender, and the variable researched gamification. That happened because, given our study contextual information, other characteristics either led to the same game elements or were not part of the paths between the trees’ root to leaf. Hence, that analysis allowed us to only consider the three user characteristics Table 3 shows, which summarizes the game elements available for each user of the personalized condition according to their information. To exemplify conditions’ differences, consider a participant who has experience researching gamification and whose preferred game genre is adventure. If this person was assigned to the personalized condition, their gamification design would be the one shown in Fig. 1b. In contrast, the same person would use the design shown in Fig. 1a if they were assigned to the OSFA condition. For a complete view of how all designs differ, please refer to our Supplementary Material.

Table 3 Game elements used in the Personalized condition according to user characteristics based on recommendations from Rodrigues et al. (2021)

Note that, in some cases, two decision trees (e.g., top-one and top-two) recommended the same game elements. In those cases, we selected the next recommended game element to ensure the system presented three game elements for all participants. For instance, if the third tree’s number one recommendation was Objectives, but one of the other trees had already recommended it, we would select the third tree’s number two recommendation. We made that choice to avoid confounding factors of participants interacting with different numbers of game elements (Landers et al., 2015). Based on that, for each student, the personalization process worked as follows. First, we analyzed their information; particularly, those described in Table 3. Second, we identified which game element to offer for that student according to their characteristics, following recommendations from the aforementioned decision trees. Finally, we assigned the student to use a gamification design that matches their characteristics.

Measures and Moderators

To measure our dependent variable - motivation - we used the Situational Motivation Scale (SIMS) (Guay et al., 2000). It is aligned to SDT (Deci and Ryan, 2000), has been used in similar research (e.g., Lavouė et al., 2018; Rodrigues et al., 2021), and has a version in participants’ language (Gamboa et al., 2013). Using the recommended seven-point Likert-scale (1: corresponds not all; 7: corresponds exactly), the SIMS captured motivation to engage with the VLE through four constructs: intrinsic motivation, identified regulation, external regulation, and amotivation (Deci & Ryan, 2000). Each construct was measured by four items, and these items’ average led to the construct’s final score. A sample prompt is Why are you engaged with the system where the activity was made? and a sample item was Because I think that this activity is interesting. Additionally, we provided an open-text field so that participants could make comments about their experiences.

The moderator analyses considered the following variables:

  • Age (in years);

  • Gender: male or female;

  • Education: High School, Technical, or Graduated;

  • Preferred game genre: Action, Adventure, RPG, or Strategy;

  • Preferred game setting: Multiplayer or Singleplayer;

  • Weekly playing time (in hours);

  • Performance: the number of errors per assessment;

  • Assessment subject: POO, IPC, LP2, or PCA;

  • Usage interval (in weeks): 0 (first usage), 4, or 6.

In summary, items one to six came from the trees’ input, while we considered the last three items due to the experimental design. Note that performance is not considered a dependent variable. Because the experimental task was completing assessments for learning, its effect on participants’ knowledge would only be properly measured after the task. Accordingly, our exploratory analyses inspect performance as a possible moderator of personalization’s effect, based on research showing performance-related measures might moderate gamification’s effect (e.g., Rodrigues et al., 2021; Sanchez et al., 2020). We also analyze user characteristics (i.e., age, gender, education, preferred game genre, preferred game setting, and weekly playing time) as possible moderators because we followed a personalization strategy grounded on research discussing those might moderate gamification’s effect (Rodrigues et al., 2021). Additionally, we study the role of contextual information (i.e., assessment subject and usage interval) as this study differs from similar work (Rodrigues et al., 2021) in those. Thereby, we investigated within personalized gamification moderators that have demanded attention in the standard approach. Lastly, note that moderators’ levels follow those in Rodrigues et al. (2021) and that gender is limited to male and female because those were the ones our participants reported.

Procedure

First, participants were invited to participate in the study. Second, they had to complete the characterization survey by the deadline, which captured identifying information plus those described in “Experimental Conditions” section. Participants self-reported their age and weekly playing time through numeric, integer fields, and other information (e.g., preferred game genre) through multiple-choice items as collected in Rodrigues et al. (2021). Such information and respecting the deadline were essential to enable personalization. Third, around mid-term, participants completed the first session’s assessment and the SIMS. Fourth, towards the term’s end, participants completed the second session’s assessment and again the SIMS. One researcher participated in both sessions, providing clarifications as required (e.g., explaining how to use the system). Additionally, at the start of session 0, the researcher presented the study goal, the procedure, and an Eagle-Edu tutorial.

Data Analysis

For the confirmatory analyses (H1), we used the same method as Rodrigues et al. (2021) because we are testing the generalization of their findings. Therefore, we applied robust (i.e., 20% trimmed means) mixed ANOVAs (Wilcox, 2011), which handle unbalanced designs and non-normal data (Cairns, 2019). We do not apply p-value corrections because each ANOVA tests a planned analysis (Armstrong, 2014).

Our exploratory analyses (RQs) follow recommendations for open science (Dragicevic, 2016; Vornhagen et al., 2020). As suggested, we do not present (nor calculate) p-values because they are often interpreted as conclusive evidence, which is contrary to exploratory analyses’ goal (Abt, 1987). Instead, we limit our analyses to confidence intervals (CIs), which contribute to transparent reporting and mitigates threats to a replication crisis in empirical computer science (Cockburn et al., 2020). To generate reliable CIs for non-normal data and avoid misleading inferences, our exploratory analyses rely on CIs calculated using the bias-corrected and accelerated bootstrap, as recommended in Cairns (2019) and Carpenter and Bithell (2000). For categorical variables, we compare participants’ motivations among subgroups, while for continuous variables we run correlations between them and the motivation constructs. For both categorical and continuous variables, we investigate whether one subgroup’s CI overlaps with that of another subgroup to understand their differences. Throughout those analyses, we consider all the moderators described in “Measures and Moderators” section). Confidence levels are 95% and 90% for confirmatory and exploratory analyses, respectively (Hox et al., 2010; Rodrigues et al., 2021). We ran all analyses using the WRS2 (Mair & Wilcox, 2018) and boot (Canty & Ripley, 2021) R packages.

Our qualitative analysis concerns open-text comments. Because commenting was optional, we expect such feedback to reveal the most important perceptions from students’ perspectives. The analysis process involved four steps and five researchers. First, one author conducted a thematic analysis (Braun & Clarke, 2006), familiarizing with the data, generating and reviewing codes, and grouping them into themes. Acknowledging the subjective nature of participants’ comments, he followed the interpretivist semi-structured strategy (Blandford et al., 2016). Accordingly, he applied inductive coding due to participants’ freedom to mention varied aspects. Next, a second author reviewed the codebook. Third, three other authors independently tagged each comment through deductive coding, using the codebook developed and reviewed in previous steps. According to the interpretivist approach, the goal of having multiple coders was to increase reliability through complementary interpretations, despite it is important that others inspect such interpretations (Blandford et al., 2016). Therefore, in the last step, the author who conducted the first step reviewed step three’s results. Here, he aimed for a wider, complementary interpretation of participants’ comments, rather than seeking for a single definitive tag for each comment. That step led to the consolidated, final report we present in the “Results” section.

Results

This section analyzes the comparability of the experimental conditions, in terms of participants’ information, and presents data analyses’ results.

Preliminary Analyses - Do groups differ?

These analyses compare conditions’ participants to identify possible covariates using robust ANOVAs (see “Data Analysis” section) and independence chi-squared tests to compare continuous and categorical variables, respectively. When counts are lower than 5, we simulate p-values using bootstrap through R’s chisq function for increased reliability. In these cases, degrees of freedom (df) are NA due to bootstrap.

We found nonsignificant differences for demographics, gaming preferences/habits, and experience researching gamification. Those results can be seen in our Supplementary Material for simplicity. For performance, design’s main effect was nonsignificant (F(1 24.8985) = 0.9042; p = 0.35) but session’s main effect (F(1, 29.8645); p = 0.0112), as well as the factors’ interaction (F(1, 29.8645); p = 0.0041) were statistically significant. Accordingly, we ran post hoc comparisons for OSFA versus personalized for both sessions 0 (t = 0.2387; p = 0.78464) and 1 (t = -1.9778; p = 0.04341) using Yuen’s test, a robust alternative to compare two independent samples (Wilcox, 2011). The results provide evidence that participants of the OSFA condition made fewer mistakes per assessment item (M = 0.964; SD = 0.509) than those of the Personalized condition (M = 1.19; SD = 0.459). Thus, preliminary analyses indicate a single statistically significant difference among conditions - session 1’s performance - when analyzing possible covariates, despite descriptive statistics show uneven distributions for some demographics (see Table 2).

Quantitative Analysis of H1

Table 4 presents descriptive statistics for all motivation constructs. Constructs’ reliability, measured by Cronbach’s alpha, was acceptable (≥ 0.7) for all but external regulation (0.59), which was questionable (Gliem & Gliem, 2003). Additionally, Table 5 shows the results from testing H1. All p-values being larger than the 0.05 alpha level reveals no statistically significant difference for all motivation constructs. Thus, our findings partially support H1 because the expected significant differences in intrinsic motivation and identified regulation were not found, whereas our data support the nonsignificant differences in external regulation and amotivation.

Table 4 Descriptive statistics, overall (Ovr) and per session (S0 and S1)
Table 5 Confirmatory analyses of H1: personalization affects autonomous motivation (intrinsic and identified) but not external regulation and amotivation

Exploratory Analyses (RQ1 and RQ2)

This section presents results for RQ1 and RQ2. As those are based on analyses of subgroups, each one’s number of participants is available in Table 2. Note, however, that all participants engaged in two sessions. Therefore, the number of data points in each subgroup is twice that in Table 2.

RQ1: Moderators of Personalization’s Effect

For continuous variables, moderations are indicated when CIs from the OSFA condition do not overlap with those of the Personalized condition. Accordingly, Table 6 indicates performance was the single continuous moderator. Results indicate that higher performance was associated with higher external motivation for OSFA users [0.4;0.7], but not for personalized users [-0.24;0.16], and that such correlations did not overlap. Hence, these results suggest performance moderated personalization’s effect on external motivation.

For categorical variables, moderations are indicated when CIs suggest a difference for a variable’s subgroup but not for others (compare columns in Table 7 for an overview). This is the case of gender. Females’ CIs do not overlap when comparing the amotivation of the personalized [1.31;1.78] and the OSFA [1.95;2.70] conditions. Differently, males’ CIs overlap when comparing the personalized [2.34;3.16] and the OSFA [1.74;2.39] conditions in terms of amotivation. Thus, suggesting gender moderated personalization’s effect, which was only positive for females. Education appears to be another moderator. Students with a technical degree who used the personalized design experienced higher intrinsic motivation than those who used the OSFA design. Preferred game genre also appears to be a moderator. When considering participants that prefer adventure games, those of the OSFA condition reported better identified regulation and amotivationFootnote 5 than those of the personalized condition. Preferred playing setting seems to be another moderator. Those who prefer singleplayer reported higher intrinsic motivation and identified regulation than those of the personalized condition. Additionally, the results suggested no differences among subgroups not mentioned. Overall, our findings indicate the assessment’s subject and usage interval did not moderate personalization’s effect on any construct, in contrast to gender, education, preferred game genre and playing setting, performance, and age (RQ1).

Table 6 Exploratory Analyses for continuous variables
Table 7 Exploratory analyses based on 90% Confidence Intervals (CIs) calculated through bootstrap

RQ2: Motivation Variation Among Conditions

Based on Tables 7 and 6 (comparing rows), student motivation varied according to six characteristics for users of the OSFA design. First, performance was positively correlated to external regulation and amotivation. Second, people whose preferred game genre is adventure reported higher intrinsic motivation than those who prefer action and RPG games and higher identified regulation, as well as lower amotivation, than those who prefer any other genre analyzed. Third, participants whose preferred playing setting is singleplayer reported higher intrinsic motivation and identified regulation than those who prefer multiplayer. Fourth, education. Those with a technical degree reported higher identified regulation than those with high school. Fifth, assessment’s subject. Identified regulation was higher for LP2 students than that of all other subjects’ students and External regulation was higher for POO students than that of IPC students. Sixth, external regulation was lower when usage interval was six or more weeks than up to four weeks.

Differently, motivation varied according to four characteristics for users of the Personalized design. First, age was negatively correlated to identified regulation. Second, amotivation differed depending on gender. Third, education. Students with a technical degree reported higher intrinsic motivation and identified regulation compared to those with other degrees. Fourth, assessment’s subject. Identified regulation was higher for LP2 students than that of PCA students and amotivation of POO students was higher than that of IPC and PCA ones. These results suggest motivation from personalized gamification varied according to fewer factors than that from the OSFA design (RQ2).

Qualitative Analysis of RQ3

Thirty-two of the 58 participants provided 52 comments (participants could comment on each session). The thematic analysis found seven codes that were grouped into two themes. In step three, researchers attributed 114 codes to the 52 comments. Lastly, the consolidation step updated the codes of 13 comments, leading to the final average of 2.19 codes per comment. Table 8 describes codes and themes, exemplifying them with quotes.

Table 8 Themes and codes attributed to participants’ comments after conducting and validating a thematic analysis

Summary of Results

  • Preliminary analyses showed participants of the personalized condition experienced more difficulty in the second session’s assessment.

  • H1 is partially supported. Surprisingly, results do not confirm the personalization positive effect on autonomous motivation - instead, indicating a non-significant difference - while they corroborate the non-significant effect on external regulation and amotivation.

  • RQ1: Exploratory analyses suggested gender and education positively moderated personalization’s effect, in contrast to preferred game genre and preferred playing setting. Personalization was positive for females and those holding a technical degree, but negative for people who prefer either the adventure game genre or the singleplayer playing setting.

  • RQ2: Exploratory analyses revealed motivation varied according to six characteristics for students who used the OSFA design: performance, preferred game genre, preferred playing setting, education, assessment’s subject, and usage interval. The analyses indicated the motivation of students who used personalized gamification varied according to only four factors: education, assessment’s subject (common to OSFA), age and gender (uncommon).

  • RQ3: Qualitative results indicated the gamified assessments provided positive experiences that students perceived as well designed and good for their learning, although a few of them mentioned gamification demands improvement and considered the assessments complex and badly presented.

Importantly, motivation is multidimensional and involves a number of constructs (see “Motivational Dimensions” section). Accordingly, changing one’s feelings in terms of any of those constructs will inevitably affect that person’s motivation. Based on that, note that we are not claiming, for instance, that gender moderated personalization’s effect on all motivation constructs. Instead, we are referring to our empirical finding that gender moderated personalization’s effect on some motivation construct (amotivation in that case) which, thus, implies an effect on general motivation as well.

Discussion

For each hypothesis/RQ we studied, this section interprets its results, relates them to the literature, and discusses possible explanations, which we present as testable hypotheses (TH). We aim for those to be understood as hypotheses derived in discussing and interpreting our findings, not as conclusions drawn from or supported by our data, because TH emerged from exploratory analyses (Abt, 1987). Therefore, aiming to increase our contribution, we provide TH to inform future research that must empirically test them.

How Does Personalization Affect Student Motivation?

Confirmatory analyses (H1) revealed no significant differences among conditions for all motivation constructs. However, participants of the personalized condition had lower performance than those of the OSFA in the second assessment. Considering research shows performance might affect gamification’s effect (Rodrigues et al., 2021; Sanchez et al., 2020), participants of the personalized condition would report lower motivation than those of the OSFA condition in that session. In contrast, our results personalized gamification provided motivation levels comparable to those of the OSFA approach even though participants experienced higher difficulty during the second session’s task. On the one hand, one might suspect that personalized gamification contributed to student motivation by making it less sensitive to their performances, and not by increasing. Based on research about seductive details (Rey, 2012), gamification might distract students with low knowledge and, consequently, affect their motivations negatively. Therefore, our suspicion is that personalization might have addressed those distractions by offering game elements suitable to the student and the task. Another possibility is that OSFA gamification’s benefits for students with low initial knowledge decrease over time (Rodrigues et al., 2021). Then, personalization might have addressed that time effect on low-performance students. Thus, research efforts should test whether personalization works by preventing distractions and avoiding time effects for low-performance students:

  • TH1: Personalization makes user motivation less sensitive to user performance.

On the other hand, one might suspect that personalization increased student motivation after a first-time experience (i.e., at session 1), but participants’ lower performance decreased it, which prevented any differences from appearing. That suspicion builds upon the lack of longitudinal studies evaluating personalization’s effect. For instance, most related work used cross-sectional studies (e.g., Hajarian et al., 2019; Lopez & Tucker, 2021; Mora et al., 2018). Only Rodrigues et al. (2021) used a repeated-measures design, which is limited to two measurements with a one-day spacing. Whereas our study also captured two measurements, spacing varied between four to six weeks. Performance differences, however, limited our findings’ contribution to understanding how personalization’s effect change over time. Thus, while personalization might mitigate the novelty effect’s impact on gamification, regardless of students knowledge level, empirical research is needed to test TH2:

  • TH2: Personalization increases user motivation after a first-time experience.

How Do Students Perceive Gamified Review Assessments?

Qualitatively analyzing open-text comments (RQ3) showed that students considered the activity good for their learning process and that they considered it well designed by approaching a perspective rarely explored in their studies: taking time to review theoretical/conceptual aspects of computing education. While empirical evidence shows the testing effect improves learning in gamified settings (Rodrigues et al., 2021; Sanchez et al., 2020), studies have not inspected users’ perceptions about such learning activities. This is important because those and other educational tasks are not motivating oftentimes (Hanus and Fox, 2015; Palomino et al., 2020; Pintrich, 2003). Thereby, we expand the literature with results that encourage the use of gamified review assessments to explore the testing effect while providing overall positive experiences to students.

Furthermore, the results corroborate gamification literature by showing it is mostly perceived positively, but not always (Sailer & Homner, 2020; Toda et al., 2018). Prior research demonstrating different people are motivated by different game elements (e.g., Bovermann & Bastiaens, 2020; Tondello et al., 2017) corroborate those results. Consequently, this suggests the need to improve the personalization strategy applied. A possible reason is that we limited gamification to feature three game elements, and the literature discusses that number predetermines gamification’s effectiveness (Landers et al., 2015). Another possible explanation is that our personalization mechanism was changing the game elements available, whereas some comments suggested changing how game elements work. A third perspective is that the recommendations on how to personalize (i.e., which game elements to use when) demands refinement to model users/tasks better. While we further inspected the latter perspective through RQ1 and RQ2, we expect future research to test whether:

  • TH3: Designs with more than three game elements improve users’ perceptions about gamification.

  • TH4: Successful personalization of gamification requires tailoring the mechanics of game elements as well as which of them should be available.

Which Factors Moderate Personalization’S Effect?

Exploratory analyses (RQ1) indicated four moderators of personalized gamification’s effect: gender, education, preferred game genre, and preferred playing setting. Because we considered those factors in defining personalized designs, we expected no moderator effect from them. The contrast is somewhat expected, however. Research on OSFA gamification shows several factors (e.g., gender) moderate its effectiveness (see “Gamified Learning: Effects and Moderators” section), even though scholars have striven to develop methods for designing it over the last decade (Mora et al., 2015). Accordingly, one should expect the need for updating such models as they are empirically tested, as with every theory (Landers et al., 2018). Therefore, we discuss two research lines to explain moderators of personalization’s effect.

First, how game elements are recommended depending on those moderators’ levels. For instance, results indicate personalization mitigated females’ amotivation, but did not work for males. The strategy we used (Rodrigues et al., 2021) does not consider gender to select game elements within contexts such as the one of this study. That is, where country is Brazil and the learning activity type is remembering. The same happens for education and preferred playing setting. Differently, preferred game genre is one of the most influential factors. However, the strategy simplifies that factor to either one prefers action genre or not. Thereby, future research should test TH5:

  • TH5: Further modeling user demographics and gaming-related preferences will improve personalization’s effect.

Another possible reason is that we used a preference-based personalization strategy. Despite that approach has been widely researched (Klock et al., 2020; Rodrigues et al., 2020), the literature advocates user preference often fails to reflect user behavior (Norman, 2004). To face that issue, researchers are investigating data-driven personalization strategies (e.g., Hajarian et al., 2019). That is, inspecting user behavior to determine the most suitable game elements (Tondello et al., 2017). Therefore, assuming that by relying on interaction data, they will reliably identify users’ preferences and, consequently, improve gamification’s effectiveness. Thus, given the limitations from preference-based strategies, future research should test TH6:

  • TH6: Data-driven personalization strategies are more effective than preference-based ones.

How Does User Motivation Variation Change when Comparing OSFA and Personalized Gamification?

Exploratory analyses (RQ2) revealed that motivation from using OSFA gamification varied according to six characteristics: performance, preferred game genre, preferred playing setting, education, assessment’s subject, and usage interval. Those are expected considering prior research has shown substantial homogeneity of gamification’s outcomes in terms of user-to-user variation (Rodrigues et al., 2020; Van Roy & Zaman, 2018) and other characteristics (Bai et al., 2020; Huang et al., 2020; Sailer & Homner, 2020). Differently, motivation from users of the personalized condition did not vary due to performance, preferred game genre, preferred playing setting, and usage interval, but similarly varied according to assessments’ subject and education. Due to personalization, one might expect reduced variation in outcomes if it leads to gamification designs suitable to every user’s preferences/motivations (Hallifax et al., 2019; Klock et al., 2020; Rodrigues et al., 2020). Then, we suspect that providing personalized game elements made motivation less sensitive to performance and reuse issues (e.g., losing effectiveness after the novelty has vanished). Thus, raising the need for testing whether:

  • TH7: Personalization mitigates OSFA outcomes’ sensitivity to user and contextual factors.

Nevertheless, personalization did not tackle variations from education and assessments subjects, besides varying due to gender and age. Assessments’ subject and age are not considered by the personalization strategy we used, which might explain their role. Additionally, while the strategy considers gender and education, it assumes those factors do not play a role in game elements selection for Brazilians doing remembering learning activities (Rodrigues et al., 2021). Hence, the rationale for such findings might be that gender and education were not completely modeled by the personalization strategy in Rodrigues et al. (2021). Thus, research should test whether:

  • TH8: Gender and Education require further modeling to properly suggest the most suitable game elements.

Implications

In light of our discussion, this section connects our findings to their implications to design and theory.

Design Contributions

First, our findings inform the design of gamified education system on how personalization contributes to gamification. Results from RQ2 suggested personalization tackled motivation variations compared to the OSFA approach. Hence, designers might not see direct effects, such as increased motivation, but personalization might be acting by minimizing the extent to which one user group benefits more/less than others. However, findings from RQ1 suggest there were moderators of personalization’s effect, warning caution is needed when applying personalization strategy we did because it might offset its contribution in some cases (e.g., improving amotivation but decreasing external motivation of females who prefer Adventure games). Thus, designers might use multidimensional personalization to offer more even experiences to their systems’ users while paying attention to possible moderators of its effect.

Second, our findings provide considerations on how to design personalized gamification. Results from RQ3 question whether using only three game elements and personalizing by changing the game elements available are the best choices for deploying and personalizing gamified designs. Therefore, contributing considerations that gamified designs with more than three game elements might improve users’ perceptions about gamification (TH3) and that successful personalization of gamification requires tailoring the game elements’ mechanics as well as their availability (TH4).

Third, our results inform instructors on the design of learning assessments. Results from RQ3 also revealed that students had positive experiences while completing the gamified review assessments and that completing the assessments contributed to their learning. Such finding is important because educational activities are not motivating for students oftentimes, and low motivation harms learning performance (Hanus and Fox, 2015; Palomino et al., 2020; Rodrigues et al., 2021). Thus, informing designers how they might successfully use such learning activities in practice, considering that gamified review assessments are valued and positively perceived by students.

Theoretical Contributions

Our first theoretical contribution relates to how personalization contributes to gamification. By triangulating findings from H1 and RQ2, it seems personalization improved gamification by offering more even experiences for the different user groups, instead of increasing the outcome’s average. Thus, contributing to researchers the question of what is the exact mechanism through which personalization contributes to gamification? In exploring answers to that question, results from RQ2 led to considerations suggesting that personalization mitigates the sensitiveness of OSFA gamification’s outcomes to user and contextual factors (TH1 and TH7). Additionally, triangulating findings from H1 and the preliminary analyses suggested personalization increases user motivation after a first-time experience (TH2) when samples’ characteristics are comparable.

Second, our results inform researchers on predeterminants of personalized gamification’s success. Results from RQ1 suggested when/to whom personalization was more or less effective. Consequently, providing theoretical considerations that the personalization strategy from Rodrigues et al. (2021) might benefit from considering other information and further modeling some characteristics it already considers (i.e., TH5 and TH8).

Third, our analyses revealed a theoretical consideration on how to develop personalization strategies. In discussing RQ1, we compared preference-based and data-driven personalization strategies. Note that they are complementary: the former allows personalizing systems from user’s first use, while the latter rely on true usage data (Tondello, 2019). Nevertheless, the limitations from user preference and evidence supporting data-driven personalization’s effectiveness (e.g., Hajarian et al., 2019) led to the theoretical consideration that data-driven personalization strategies are more effective than preference-based ones (TH6).

Lastly, we share our data and materials. That complies with open science guidelines and literature recommendations toward mitigating the replication crisis in empirical computer science (Cockburn et al., 2020; Vornhagen et al., 2020). Additionally, we personalized gamification based on a freely available recommender system (Rodrigues et al., 2021), besides using a gamified educational system (Eagle-Edu) that is education/research friendly. Thus, extending our contribution and facilitating replications.

Limitations and Future Work

This section discusses study limitations and presents future research directions accordingly. First, our sample size (n = 58) is not far but below related works’ median. That might be partly attributed to attrition, which is common for longitudinal designs and caused a loss of 38% of our eligible participants. Because this study was conducted during the COVID-19 pandemic, instructors mentioned they witnessed unseen drop-out rates, which might explain the attrition rate. While that size affects findings’ generalization, we believe that having conducted a multi-site study in ecological settings leads to a positive trade-off. Despite sample size also affects our confirmatory analyses’ validity as it implies low statistical power, we sought to mitigate that issue by only conducting planned comparisons.

Note that we planned a one-factor experiment to compare OSFA and personalization. Because personalization is built upon the idea of having different people using different designs depending on, for instance, their characteristics, it is expected to have a distinct number of participants in each gamification design. Hence, we believe this aspect does not hinder our study validity. Nevertheless, exploratory analyses’ results are more prone to sample size limitations because they rely on subgroups. This is the reason our exploratory analyses were limited to one-factor comparisons (e.g., males versus females) - instead of multi-factor ones (e.g., females who prefer single-playing versus females who prefer multi-playing; then the same for males) - and explicitly presented them as findings to be tested (see “Method” section). To further cope with this limitation, we used 90% CIs measured through bootstrap, aiming to increase results’ reliability while avoiding misleading conclusions that could emerge from reporting and interpreting p-values (Cairns, 2019; Carpenter and Bithell, 2000; Vornhagen et al., 2020).

Second, our study is limited to a single dimension of Bloom’s taxonomy. That was a design choice aimed to increase internal validity, whereas a multi-factor experimental study (e.g., comparing multiple knowledge dimensions) would add numerous confounders and require a larger sample. While this limits the generalizability of our findings, we believe that choice leads to a positive trade-off in allowing us to approach a specific research problem with greater validity. Hence, given research discussing the learning task affects gamification’ success (Hallifax et al., 2019; Rodrigues et al., 2019), only further empirical evidence can answer whether our findings will be the same for other dimensions of Bloom’s taxonomy. Thus, especially considering our findings confront those of similar research (Rodrigues et al., 2021), those sample size and research context limitations, along with our testable hypotheses, provide directions for future replication studies that must validate and test our results’ generalization.

Third, students of the same class used a new system wherein the gamification design varied from one to another. We informed participants they would use different gamification designs, but not that PBL was the control group. Therefore, we believe contamination did not substantially affect our results. Moreover, all participants had never used Eagle-Edu, and they only used it twice during the study. Consequently, there is no evidence on how participants’ motivations would change when completing gamified review assessments more often and over multiple terms. Based on those, we recommend future studies to analyze how personalized and OSFA gamification compare, in the context of review assessments, when used for longer periods of time.

Lastly, there are four limitations related to our study’s instruments/apparatus. Concerning our measure, the external regulation construct showed questionable reliability, similar to prior studies (Gamboa et al., 2013; Guay et al., 2000; Rodrigues et al., 2021). Aiming to mitigate that limitation, we controlled for between participants variations in our quantitative analyses, despite some argue such issue might not be pertinent to HCI research (Cairns, 2019). Additionally, students might have completed the instrument based on different experiences than they had with the gamified system. While we carefully instructed them on how to complete SIMS, we cannot ensure this due to human biases and subjectivity (Blandford et al., 2016; Wohlin et al., 2012). Concerning Eagle-Edu, we needed to create one course shell for each gamification design. Consequently, some of those had few students. That technical issue affected designs featuring leaderboards, leading to cases wherein students had few peers to compete against. Concerning the personalization strategy, it was developed based on user preference, which is often criticized compared to data-driven approaches (Norman, 2004), and was only initially validated compared to OSFA gamification (Rodrigues et al., 2021). In summary, those limitations suggest the need for i) further inspecting SIMS’ validity in the context of learning assessments, ii) explicitly studying how small competition affects students’ motivations, and iii) extending the external validity of the personalization strategy introduced in (Rodrigues et al., 2021).

Conclusion

VLE play a crucial role in enabling assessment for learning. That approach has strong support for its positive effect on learning gains, but is not motivating for students oftentimes. While standard, OSFA gamification can improve motivation, effects’ variation inspired research on personalized gamification. However, there is little knowledge on how personalization contributes to OSFA gamification. Therefore, we conducted a multi-site experimental study wherein students completed gamified assessments with either personalized or OSFA gamification. Our results suggest a new way of seeing personalization’s role in gamification and inform designers, instructors, and researchers:

  • We show whereas personalization might not increase outcome’s average, it likely improves gamification by reducing its outcome’s variation;

  • We show gamified review assessments provide positive experiences considered good learning means from students’ perspectives;

  • Our discussion provides design and research directions toward advancing the field study;

Our results inform i) designers interested in personalized gamification, showing what benefits to expect from it; ii) instructors using interactive systems to deploy assessments for learning on the value of gamifying them; and iii) personalized gamification researchers with guidance on how to advance the field study. Also, we extend our contribution by sharing our data and materials.