Nothing Special   »   [go: up one dir, main page]

2009 Parnell

Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

Playing with Scales: Creating a Measurement Scale to

Assess the Experience of Video Games

Mark James Parnell

Project report submitted in part fulfilment of the requirements for the degree of
Master of Science (Human-Computer Interaction with Ergonomics) in the
Faculty of Life Sciences, University College London, 2009

NOTE BY THE UNIVERSITY

This project report is submitted as an examination paper. No responsibility can


be held by London University for the accuracy or completeness of the material
therein.
Acknowledgments

My heartfelt thanks to everyone that has helped me over the course of this

project – from friends to family to participants, all of your help has been much

appreciated. Many thanks to my supervisors, Dr. Nadia Berthouze and Dr. Duncan

Brumby, for their kind guidance and patience, as well as to Dr. Eduardo Calvillo

Gámez for his support.

Thanks also to Dr. Wijnand IJsselsteijn and Karolien Poels of Eindhoven

University for the use of the GEQ questionnaire in the review, as well as to Laura

Ermi and Dr. Frans Mäyrä for supplying a copy of their Immersion questionnaire for

review.

Finally, special thanks to David Tisserand at SCEE for all of the insight,

assistance and backing that he has provided me throughout this project. The project

couldn’t have happened without it.

- MJP

ii
Abstract

A video game should be appealing to play. It should be usable, playable and

provide enjoyable experiences. One tool for assessing the appeal of a game is to have

gamers complete a questionnaire (or scale) after they have played the game. Of the

current battery of scales that exist, none of them provides an integrated measure of a

game’s appeal. To address this gap a Gameplay Scale is presented that assesses

gamers’ attitudes towards a game’s appeal and quality. The Gameplay Scale is

validated across two studies. Study 1 had gamers (n = 98) respond to a web survey

after playing the downloadable game PixelJunk Eden for 2 hours. Cluster analysis of

responses found that the Gameplay Scale contained distinct subscales measuring

different gameplay constructs: (1) Affective Experience, (2) Focus, (3) Playability

Barriers, and (4) Usability Barriers. Overall, the Gameplay Scale accounted for 73%

of the variance in a game’s initial appeal. Study 2 validated the Gameplay Scale by

showing how it generalizes to different genres of games (i.e. open-world) and is able

to predict a game’s appeal and quality (i.e. by review score) after a relatively short

period of game play (1 hour). These findings suggest that the Gameplay Scale can

predict the appeal and quality of a game. This information may be of value to game

developers who wish to evaluate a game’s likely appeal during the development

process.

iii
Contents

1 Introduction..................................................................................................... 1

2 Literature Review............................................................................................ 3

2.0.1 Overview ............................................................................................. 3

2.0.2 Definitions of Terms Used ................................................................... 3

2.1 Factors Involved in the Player Experience .................................................. 6

2.1.1 Player Experience and Engagement ..................................................... 6

2.1.2 Flow, Cognitive Absorption and Challenge.......................................... 7

2.1.3 Presence, Immersion and Fun ............................................................ 11

2.2 Usability and Playability in Games ........................................................... 15

2.3 Measuring the Video Game Experience .................................................... 19

2.3.1 Evaluating Selected Constructs .......................................................... 19

2.3.2 Scale Design and Validation .............................................................. 22


Order Effects .................................................................................... 23
Question Wording............................................................................. 23
Response Item Design....................................................................... 24
Web Surveys..................................................................................... 26

2.3.3 Previous Gameplay Scales ................................................................. 26

3 Study #1: Scale Construction, Validation and Refinement.......................... 30

3.1 Rationale .................................................................................................. 30

3.2 Questionnaire Construction....................................................................... 30

3.2.1 The Gameplay Scale .......................................................................... 30

3.2.2 The Appeal Scale............................................................................... 32

3.3 Methods ................................................................................................... 32

3.3.1 Participants ........................................................................................ 32

3.3.2 Materials............................................................................................ 33

3.3.3 Procedure........................................................................................... 33

3.4 Results and Analysis................................................................................. 34

iv
3.4.1 The Gameplay Scale .......................................................................... 34

3.4.2 The Appeal Scale............................................................................... 37

3.4.3 Inter-Scale Correlations ..................................................................... 37

3.5 Interim Discussion.................................................................................... 39

4 Study #2: Further Exploration of Scale Validity ......................................... 44

4.1 Rationale .................................................................................................. 44

4.2 Methods ................................................................................................... 46

4.2.2 Participants ........................................................................................ 46

4.2.3 Materials............................................................................................ 47

4.2.1 Design ............................................................................................... 48

4.2.4 Procedure........................................................................................... 48

4.3 Results...................................................................................................... 49

4.3.1 Significance Testing Between Groups................................................ 49

4.3.2 Inter-Scale Correlations ..................................................................... 51

4.3.3 Comparison with Study 1 Data........................................................... 53

4.4 Interim Discussion.................................................................................... 54

5 General Discussion ........................................................................................ 59

6 Conclusions.................................................................................................... 64

References......................................................................................................... 66

Appendix A - The Initial Gameplay Scale....................................................... 75

Appendix B - Results of Cluster Analysis on the Initial Gameplay Scale ...... 78

Appendix C - The Revised Gameplay Scale .................................................... 80

Appendix D - The Appeal Scale ....................................................................... 81

Appendix E - Information Sheet and Consent Form Used in Study #2 ......... 82

v
List of Illustrative Figures

Figure 2.1. Graph Showing Flow Channel…………………………………………. 8

Figure 2.2. Graph Showing How Differing Balances of Challenge and Skills
Result in Different Affective Experiences.…………… ……………….................... 9

Figure 2.3. Screenshot of Sega’s Rez……………………………………………….. 14

Figure 2.4. Example Likert-Type Response Item…………………………………... 21

Figure 2.5. Example Likert-Type Response Item…………………………………... 25

Figure 4.1. Screenshots of Sega’s Hulk and Radical Games’s Prototype………….. 44

Figure 4.2. Sony’s Dualshock 3 Controller………………………………………… 47

Figure 4.3. Mean Summed Gameplay Scale Score and SD for each Game.……….. 50

Figure 4.4. Correlation Between Gameplay Scale and Appeal


Scale………………………………………………………………………………… 53

Figure 4.5. Mean Gameplay Scale Scores for each Game…………………………. 54

vi
List of Tables

Table 2.1. Gameplay Heuristics Found Across the Literature……………………... 18

Table 2.2. Four Main Factors of Video Game Experience…………………………. 20

Table 3.1. Four Main Factors of Video Game Experience…….................................. 30

Table 3.3. Revised Gameplay Scale Items and Correlations to Scale and Subscale
Scores………………………………………………………………………………... 35

Table 3.5. Spearman’s Correlations Between Subscales, the Gameplay Scale and
the Appeal Scale for Study 1……………………………………………………………….. 37

Table 4.1. For Each Game, the Mean Item Score for Each Subscale and the
Averaged Summed Scale Scores, and Standard Deviation……………..................... 49

Table 4.2. Spearman’s Correlations between Subscales, the Gameplay Scale and
Appeal Scale for Study 2……………………………………………………………. 52

vii
1 Introduction

“Play is older than culture, for culture, however inadequately

defined, always presupposes human society, and animals have not waited

for man to teach them their playing” (Huizinga, 1938/1998; p. 1.)

Academic video games researchers (both within and outside of the HCI

community) now research play experiences extensively, yet few of their findings have

any impact upon the video games industry (Hopson, 10 November 2006). A great

challenge for researchers is to support industry practice, and one way to do this is to

develop tools to improve the user experience of games. As with all software, the

developers of video games are not the same as their users. Despite being gamers

themselves, their attitudes towards their creations will inevitably differ to those of

their audience. The result is that video games often have issues where the end user

struggles to operate or understand the game. In productivity software development,

the remedy for this has been to use usability principles and testing methods to detect

and eliminate any such usability problems. However, such techniques have, until

recently, been slow to catch on in video games development (Fulton, 2002), yet they

are perhaps even more important here. A user who has to struggle with a poorly-

designed word processor at their office may grumble (and have reduced efficiency)

but in the end has to use the word processor. This is not the case with video games –

playing video games is a choice, and the player can always put the controller down if

the game is too hard, or clunky, or simply isn’t any fun (Laitinen, 23 June 2005). This

1
means that removing any barriers to play – and to fun – is of the utmost importance if

the video game is to be as appealing as it can be.

Methodologies to test the usability and user experience of video games (often

called ‘player testing’ methods) do exist (e.g. Pagulayan et al, 2003; Kim et al, 2008)

and have no doubt improved the play experience of many games. One important part

of these methods involves the use of questionnaires to measure player attitudes after

play, especially since ‘think-aloud’ protocols during the game can distort the player

experience significantly (Pagulayan et al, 2003). No questionnaire exists in the

literature that measures all aspects of play experiences; what Hassenzahl et al (2000)

called the ‘ergonomic’ and ‘hedonic’ factors that make up user experience; there is a

real need to measure all elements of the experience. Moreover, Hornbaek (2006)

called the measurement of user experience via questionnaire ‘in disarray’ with little

utilisation of existing research or methods; there is also a need to create a well-

designed measurement tool. This thesis aims to determine how best to measure these

‘hedonic’ and ‘ergonomic’ factors using questionnaires, by developing a new

questionnaire that is both valid and reliable. To prove that this is useful to the

industry, whether review scores can be predicted by the scale will also be determined.

The first chapter reviews the literature, first to identify what elements the scale

must measure, before examining best practice in questionnaire design. Previous

similar questionnaires are then reviewed to determine what they got right (and

wrong). The third chapter involves the first study, in which the questionnaire is

initially developed and validated, whilst chapter four involves the further validation of

that questionnaire in experimental conditions. Chapter five will discuss the successes,

failures and implications of the study, whilst chapter six serves as a recapitulation and

conclusion of the research.

2
2 Literature Review

2.0.1 Overview

There are numerous ways to measure the usability of productivity software

that can be translated to video games; all have their strengths and weaknesses, but one

that can be particularly useful in the context of user testing is the questionnaire. As

will be argued, questionnaires play an important role in player testing yet there is no

existing standardised questionnaire that measures all of the factors contributing to a

game’s appeal that need to be considered. Additionally, many existing questionnaires

that do measure some of the factors are flawed. The first section of this review will

define the key factors to be measured; these are usability, playability and player

experience. The different sorts of player experience will then be examined, as will the

application of the terms ‘playability’ and ‘usability’ to video games. Once we have

considered what factors any novel questionnaire will need to include, best practice in

questionnaire design will be examined, and these principles used to critique existing

questionnaires. Such a review should provide principles with which a new player

testing questionnaire can be developed.

2.0.2 Definitions of Terms Used

It is common in the field of Human-Computer Interaction (HCI) to divide the

interaction between user and system into the overarching factors of usability and user

experience. Usability is often described using the ISO 9241-11 definition.

3
“The extent to which a product can be used by specified users to

achieve specified goals with effectiveness, efficiency and satisfaction in a

specified context of use.” (ISO 9241-11, 1998)

Such a definition provides us with the essence of usability – the quality

of how well the software hinders or enables users’ achievement of their goals when

using software. We might also wish to add the factor of learnability to this definition,

as a system could be effective, efficient and satisfying yet very difficult to learn

(Abran et al 2003), which would inhibit system usability for novices.

Despite satisfaction being listed above as an element of usability it is rarely

considered as such. Indeed, for a long time it was rarely considered at all, with the

focus of research and evaluation purely being upon effectiveness and efficiency (i.e.

Nielsen and Molich, 1990; Polson et al, 1994). Gradually, the HCI community

realised that it needed to go ‘beyond usability’ and simple measures of satisfaction to

examine the broader user experience, including issues such as self-efficacy, aesthetics,

social factors and fun (Dillon, 2003). Applied to video games, this gives us the

concept of player experience – the experience of the user playing the game.

It would, however, be erroneous to fully cleave the affective experience of an

interaction from the usability of the software used. For example, it is now common

knowledge that aesthetic properties of a system can influence the perceived usefulness

of the system (Tractinsky et al, 2000). Moreover, both hedonic (i.e. experiential) and

ergonomic (i.e. usability) qualities have been found to influence a system’s appeal to

users (Hassenzahl et al, 2000). This entails that these two factors interact and that

both are important for systems. Whilst these factors vary in their importance for

4
different systems, it is likely that both influence the appeal of all software. This is the

case with video games; games may not involve goals in the manner of productivity

software but have the purpose of delivering a certain experience. Usability issues must

be corrected so that such an experience can be delivered.

The final factor is sui generis to video games (or rather, to the domain of

games in general): playability. The distinction between usability and playability is

usually seen as usability relating to interface and control issues and playability

relating to game mechanic issues (Korhonen and Koivisto, 2007; Febretti and

Garzotto, 2009). A game menu being difficult to navigate would be a usability issue;

ensuring that combat in a game has the correct pace would be a playability problem.

Playability thus regards how the game itself operates; its rules and its level of

challenge. Some playability problems, such as unfairly advantaged ‘cheating-AIs’

(Shelley, 15 August 2001) are clearly distinct from usability problems, yet others are

not – when players feel they are not in control of their character, is that a playability

issues relating to poor player empowerment or a usability issue relating to poor

controls? Does a poor in-game camera impair playability or usability?

In short, whilst some playability concepts are clearly distinct from usability

problems, many are not. Nevertheless, there is good reason to treat them as separate

constructs in at least some respects; playability problems are more fundamental to the

game design than usability problems, and these need to be prioritised, tested and

caught sooner (Korhonen and Koivisto, 2006). For the evaluator then, playability is

best treated as a domain-specific class of critical usability qualities. The next task is to

examine what experiential, usability and playability factors relate to video games.

5
2.1 Factors Involved in the Player Experience

2.1.1 Player Experience and Engagement

Studies of player experience have focused upon a number of different

constructs in an attempt to determine what makes video games so engaging.

Engagement is a term used to characterise a state of involvement with a piece of

software; a video game with an enjoyable player experience is thus said to be

engaging. Whilst the term has been given various meanings in the literature (i.e.

Lindley et al 2008; Douglas and Hardagon, 2000), Lazzaro (2004) models player

engagement as resting upon the four ‘keys’ of Hard Fun (or challenge), Easy Fun

(involving immersion, curiosity and delight), Altered States (emotion, relaxation) and

The People Factor (social interaction). This model recognises that games do not need

to be challenging to be engaging – the game world itself can engage players

sufficiently. Nevertheless, whilst categorizing the aspects of engagement in such a

way is useful, it doesn’t examine the constituents of these factors in enough depth.

Why should we include narrative? Can we have easy fun without narrative? What

emotions are involved in ‘Altered States’? Indeed, this is the problem with the entire

notion of engagement; it simply restates player experience without unpacking it

enough. Nevertheless, if we hold Lazzaro’s concepts of Hard and Easy Fun to still be

useful (as these are the most universal to all games) then the first element to examine

in greater depth is challenge and the construct of flow.

6
2.1.2 Flow, Cognitive Absorption and Challenge

Mihalyi Csikszentmihalyi (1975; 1990) found that the experience of

performing with a high level of skill at challenging tasks had a peculiar character that

he called flow. Individuals who engage in tasks in order to experience flow are

engaging in an autotelic (auto = self, telos = purpose) activity – their internal

motivation replaces any external motivation. Indeed, people were found to be at their

happiest when engaging in such an internally motivated task. Eight main elements of

such flow experiences were identified by Csikszentmihalyi (1990):

1. A challenging but completable task

2. Attention is focused wholly upon the task

3. The task has clear, unambiguous goals

4. The task provides immediate feedback for actions.

5. The individual feels fully in control

6. Immersion in the task that removes awareness of everyday life

7. Sense of self diminishes, but is reinforced afterwards

8. Awareness of the passage of time is reduced.

However, these factors must occur during a task that balances the individual’s

skills with the challenges that they face; too little challenge vs. skill and the user can

become bored; too much and they become anxious and lose their sense of control. In

the narrow band between boredom and anxiety lies the flow channel; activities that

elicit experiences in this band are so rewarding that individuals will go to great

lengths to engage in them for the sake of the experience (see Figure 2.1 below).

7
High

Anxiety
Flow
Channel

Boredom

Low
Low High
Skills
Figure 2.1. Graph showing the relationship between
Challenge and Skills – the correct balance results in Flow
experience. Redrawn from Csikszentmihalyi (1990)

This was later built upon by Massimini and Carli (1988) who noted that the

individual’s mean experience was neither optimal nor negative; rather it was neutral.

Their experience fluctuation model (see Figure 2.2 below) better accounts the variety

of human experience. For most activities with average challenges for which we

possess average skills we do not experience flow; it is only when our skills and the

corresponding challenge are high that flow is experienced. Indeed, the key to

understanding flow is to recognise that it is an optimal experience, and certainly not a

mundane one. To continue experiencing flow becomes a central goal for the

individual, whatever the source of the optimal experience – including video games

8
High

Arousal

Anxiety Flow

Worry Subject Control


Mean

Apathy
Boredom

Relaxation

Low
Low High
Skills
Figure 2.2. Graph showing how differing balances of challenge
and skills result in different affective experiences.
Adapted from Massimini and Carli (1988)

Chen (2007) held flow to be the sine qua non of an enjoyable game

experience. Chen suggests that games must adapt to the different flow zones (i.e.

difficulty level-tolerances) to ensure that as many users experience flow as possible.

Another model that suggests ways to maximise player flow is the GameFlow model

of Sweester and Wyeth (2005). This takes each element of flow (as listed above) and

takes a game feature that must be present and/or optimised for flow to occur. These

are concentration, challenge, player skills control, clear goals, feedback, immersion

and social interaction. However, the faults of this, and similar models, are two-fold.

First, flow is an optimal experience that gamers will only experience on occasions and

perhaps only fleetingly (as noted by Jennett et al, 2008) whilst flow-like experiences

are by no means a necessary component of an enjoyable gaming experience (Cowley

9
et al, 2008). Second, many of the descriptors used to signify flow (such as ‘temporal

dissociation’, ’concentration’ and ‘control’) can be explained by other constructs.

One such construct is Agarwal and Karahanna’s (2000) concept of cognitive

absorption, that describes a state of deep involvement with a piece of software

through the factors of temporal dissociation, focused immersion or total engagement,

heightened enjoyment, control and curiosity. The precedents of this deep involvement

were stated as the perceived usefulness and perceived ease of use of the system (the

‘usefulness’ of games could be considered to be their enjoyability) – clearly less

specific requirements than for an optimal flow experience. The key point is that

cognitive absorption invokes many similar experiences to flow (though assumedly at

a lesser intensity) without it being an optimal experience. Whilst gamers may

occasionally experience flow, their mundane experience of involvement with a game

is perhaps best explained by cognitive absorption. It is thus likely that when many

investigators thought that they were examining flow they were in fact measuring

milder, less optimal forms of experience.

However, we shouldn’t as a result ignore flow; flow is still a measure of a

successful game, but just as diamonds can measure one’s wealth it is an exceptional

measure and not the norm. When we come to measure cognitive absorption, very high

reported levels of absorption (and one other factor – as below) should be taken to

represent flow.

Challenge is the other factor that, when appropriate levels of it are reported, is

likely to indicate flow. All games should provide an adequate challenge yet not be too

difficult; this is both a basic factor underlying flow and a common heuristic suggested

for game design (i.e. Federoff, 2002; Desurvire, Caplan and Toth 2004). By

separating our measure of flow into measures of challenge and of cognitive absorption

10
it allows us to measure both suboptimal and optimal experience. Challenge and (the

factors that underlie) cognitive absorption will describe flow if optimal; otherwise

they will describe qualities of the average gaming experience.

2.1.3 Presence, Immersion and Fun

Having considered Lazzaro’s (2004) ‘Hard Fun’ element of gaming fun, we

must now consider its little brother, “Easy Fun”. The aspect of altered states will be

considered (under the banner of ‘fun’), though ‘the person factor’ will not be (as

though it is important, it is limited to multiplayer games).

The phenomenology of interacting with a game, especially if it involves

avatars, is very idiosyncratic, and a number of constructs have been used to explain it.

One such construct is the notion of presence. This concept arose from Virtual Reality

(VR) research, where a peculiar feeling as though one is in the Virtual Environment

(VE) was noticed by researchers. Floridi (2006) defines presence as:

“a type of experience of “being there”, one loosely involving

some technological mediation and often depending on virtual

environments” (Floridi, 2006)

Both Pinchbeck (2005) and Takatalo (2006) find that presence is a relevant

concept for video game experience. This seems unlikely – not only was spatial

presence (the essence of presence) the weakest extracted factor for Takatalo (2005)

(behind such factors as role engagement and attention) but gamers do not speak of

spatial presence in regards to their experiences (Jennett et al, 2009). Rather, their

11
‘being there’ is in a narrative, causal and social sense; it is categorically not a spatial

presence.

If we discount the notion of presence, how then do we account for the sense of

being “in” a game world? Brown and Cairns (2004) used grounded theory to examine

what gamers meant when they spoke of being immersed in a game. They found three

levels of immersion:

1.) Engagement – the player must invest time effort and attention to overcome barriers

to the game – such as learning the controls or comprehending the setting.

2.) Engrossment – the game dominates player attention and players become

emotionally invested, provided that the game mechanics and plot are well constructed.

3.) Total Immersion – players experience presence, empathy with characters and are

totally absorbed with the game.

The first thing to note is that, as per Jennett et al (2009) it is unlikely that

players experience presence when totally immersed, whilst immersion should be

considered as a spectrum and not as a set of discrete stages. Secondly, this model of

immersion is somewhat simplistic, and treats immersion as a single phenomenon that

encompasses both challenge and diegesis. It is thus just another label for ‘gameplay

experience’ if treated in this way. Considered in this way it still tells us something

interesting; both that the gameplay experience is a spectrum and that barriers must be

overcome to progress through this spectrum – including player experience, usability

and playability problems.

12
Brown and Cairns (2004) failed to fully consider the narrative aspects of

immersion, yet many have suggested that immersion is a narrative phenomenon, and

that immersion involves deep engagement with a plot or setting. (McMahan, 2003;

Douglas and Hargadon 2000). Ermi and Mäyrä (2005) account for this understanding

of immersion in their SCI model of immersion, which identified three types of

immersion:

Sensory immersion – the player becomes immersed in the sensory information –

visual, auditory and tactile - that a game provides. Sega’s Rez

(http://www.thatgamecalledrez.com/ - see Figure 2.3 below) is probably the purest

example of this.

Challenge immersion – immersion resulting from a balance of challenges and skills,

requiring motor skills and/or strategy.

Imaginative immersion – immersion in the fantasy of the game, the plot, the game

world and identification with the characters.

13
Figure 2.3. Sega’s Rez, a classic example of how visual, auditory and haptic
(a “trance vibrator” peripheral was released) interactions combine to induce
sensory immersion in players. From www.ign.com.

A game can support all three of these types of immersion, whilst the elements

of the SCI model are fully compatible with Brown and Cairn’s (2004) levels of

immersion giving us a more refined model of immersion. Two amendments are

suggested. First, though challenge is an important element of any model of gameplay

experience, calling it ‘immersion’ recalls the habit of redescribing gameplay

experience in terms of one construct – be it immersion, flow or presence. It would be

better to only consider sensory and narrative immersion in our questionnaire and just

leave challenge as it is – as challenge. In addition, as per Arsenault (2005),

imaginative immersion is best named fictional immersion to better capture its

character.

Calvillo Gamez, Cairns and Cox (in preparation) created a grounded theory

from score of press games reviews and articles that centred on the notion of puppetry.

This involves factors of control (how to manipulate the game), ownership (the player

comes to set personal goals and is provided with rewards) and facilitators (such

aesthetics, that allow for control and ownership). Ownership is the most interesting

14
factor here, as it provides an explanation of persistence and replayability - why do

players come back to a game? The model suggests that their own goals and sense of

reward that keeps players going; they own the game in choosing which challenges to

take. Some of these may be easy (and the player can thus feel rewards from showing

their mastery) and some difficult. The idea of ownership must also be measured.

The final piece of the puzzle requires us to consider fun. If we consider fun in

terms of dimensional affect, it involves high arousal and high positive valence. Not all

games induce a sense of fun directly (e.g. survival horror games aim to scare; a well

written roleplaying game may induce grief) but the net affective experience of playing

a game should involve fun (Zagalo et al, 2005). In his classic research on fun, Malone

(1981) found the key factors of fun are challenge, fantasy (akin to narrative) and

curiosity. This provides us with two more factors – affect valance (which should be

positive overall) and variety, as a lack of variety greatly inhibits curiosity and thus

fun.

2.2 Usability and Playability in Games

As the earlier discussion of player experience noted, reaching deeper and more

enjoyable levels of experience requires overcoming barriers; if these barriers are too

great, the experience will be diminished. Some of these barriers to play relate to the

player experience (such as the difficulty level). Others relate to the usability and

playability aspects of the game (such as poor controls). This section will determine

how questionnaires are an important tool for removing usability and playability

barriers as well as player experience issues, and that there is a need for a new

questionnaire with which to do this.

15
Gilleade and Dix (2004) distinguished between at-game frustration (resulting

from poor controls and interfaces) and in-game frustration (resulting from unclear

goals, navigation and similar). At-game frustration is always detrimental to the

gameplay experience; in-game frustration (as IJsselsteijn et al (2007) note) is not

always harmful. In-game frustration does not necessarily come from detrimental

factors such as unclear goals, but also arises from the challenge of the game. If we

remove all in-game frustration, the player has nothing to overcome and thus cannot

experience fiero, or the experience of personal triumph over adversity (Lazzaro, 2004)

– a critical emotion for gamers.

Standard usability evaluation techniques seek to remove all sources of

frustration; we therefore need to tailor our evaluation methods to the video game

domain. We should triangulate on usability (and playability) problems using a number

of methods (Gray and Salzman, 1998). Kim et al (2008) describe the TRUE (Tracking

Real-time User Experience) methodology. This involves recording user-initiated

events; sets of data that describe what the user was doing when an event was initiated.

So if a player crashes in a racing game, their speed, the track, the conditions, their

location, etc are recorded. This is combined with observational (via video) and

attitudinal (via questionnaire and interview) data to determine what a player was

doing throughout a level or track that would lead them to enjoy or dislike it.

Questionnaires are thus an important part of this process, but they should not

be understood as uncovering problems; interviews and observations are more

effective for this. First, questionnaires can suggest to the evaluator where they must

look in a huge dataset to uncover problems. If players found the difficulty too hard,

this would suggest that they keep dying or losing a race, and the problem could be

16
uncovered by examining the relevant part of the dataset. Second, such questionnaires

can act as a rubberstamp and quantify the severity of problems found in the other data

sets

Which questionnaire to use to do this? An extensive sweep of the literature

suggests that no validated usability and/or playability scale exists, compared to the

numerous scales available for productivity software usability - e.g. Chin et al (1988).

However, there are a number of studies that generated heuristics for evaluating games.

Such heuristics could generate areas of interest or constructs that should be examined

by any future questionnaire.

A number of existing sets of usability heuristics were examined (namely,

Febretti and Garzotto, 2009; Desurvire and Wiberg, 2009; Pinelle, Wong and Stach,

2008; Federoff, 2002 and Korhonen and Koivisto, 2006). Desurvire and Wiberg

(2008) was excluded from this analysis as their heuristics focused on game

approachability (and thus focused on casual gamers – for whom a different set of

factors are appropriate and are not the focus of this analysis) whilst Korhonen and

Koivisto (2007) was excluded due to the focus on multiplayer games. The analysis is

summarised in Table 2.1 below

17
Table 2.1. Table Displaying Gameplay Heuristics Found Across the Literature

Heuristic Previous Studies that Included It Include in Why exclude from


Current Study?
Study?
Control(s) All Yes N/A

Goals All Yes N/A

Interface All Yes N/A

Consistency All but Febretti et al (2009) Yes N/A

Help All but Febretti et al (2009) Yes N/A

Customisation All but Korhonen et al (2006) Yes N/A

Variety All but Pinelle et al (2008), Febretti Yes N/A


et al (2009)
Navigation Federoff et al (2002), Desurvire et al Yes N/A
(2008)
Views Pinelle et al, Febretti et al (2009) Yes N/A

Challenge All but Pinelle et al (2008) No Experience Factor

Immersion All but Pinelle et al (2008) No Experience Factor

Feedback All but Pinelle et al (2008), Febretti No Covered by other


et al (2009) Heuristic
Error Recovery All but Pinelle et al (2008), Febretti No Covered by other
et al (2009) Heuristic
Rewards All but Pinelle et al (2008), Febretti No Experience Factor
et al (2009)
Terminology All but Pinelle et al (2008), Febretti No Covered by other
et al (2009) Heuristic
AI All but Korhonen et al (2006) No Genre-specific

In Table 2.1 above, ‘Control(s)’ refers to the quality of the game’s controls

and the player’s feeling of control; ‘Goals’ to the need for clear player objectives and

‘Customisation’ to the need for customisable controls and settings. ‘Consistency’

means the consistency of input to output mappings; ‘Views’ to the quality of the in-

game perspective; ‘Interface’ to the game’s menus and (in-game) Heads-Up Display

(HUD) and ‘Help’ to the need to provide help to the player. Finally, ‘Navigation’

entails that the player should not get lost in the game world (i.e. it has a slightly

18
different meaning to the concept of navigation in productivity software) and ‘Variety’

entails that the player should enjoy a range of gameplay elements.

The ‘Challenge’, ‘Immersion’ and ‘Rewards’ heuristics are already considered

by experience and challenge items on the scale (reward being an element of

challenge). ‘Feedback’, ‘Error Recovery’ and ‘Terminology’ are covered by other

heuristics (i.e. ‘Goals’ and ‘Consistency’ cover feedback; ‘Interface’ largely exhausts

terminology). Not all games have ‘Artificial Intelligence’ (AI) as not all have

computer-controlled opponents (i.e. multiplayer games) so this was therefore

excluded.

Overall, the heuristics have provided a foundation upon which a scale can be

constructed. As was argued, both usability and playability should be evaluated as both

are needed to improve a game. Given the distinction between usability and playability

that was defined earlier, the above heuristics can be divided into playability and

usability factors (e.g. ‘Goals’ involves a playability issues; ‘Interface’ is a usability

issue, etc). Additionally, as no usability and playability scale exists creating a new one

would clearly facilitate player testing. Both player experience and usability/playability

factors to be included in any games evaluation have now been considered; the next

task is to discuss the construction of the questionnaire.

2.3 Measuring the Video Game Experience

2.3.1 Evaluating Selected Constructs

Given the above review, the following constructs in Table 2.2 (below) were

identified as needing to be evaluated for video games.

19
Table 2.2. Four Main Factors of Video Game Experience and Sub-Constructs

Factors Mediating Video Game Experience


Experience Challenge Playability Usability
Fictional Immersion Challenge Variety Control
Sensory Immersion Absorption Clear Goals Customisability
Affective Valence Ownership Navigation Consistency
Help/Training Camera (Views)
Game Interface

Table 2.2 shows that Experience, Challenge, Playability and Usability all need

to be considered. By using a questionnaire involving closed questions we can quantify

the degree to which users had a problem with, or enjoyed, an element of the game,

just as we can quantify the nature of their overall experience. If well designed, such a

scale would correlate with (and thus help us to predict) important measures of a

game’s success: review scores, sales, the game’s appeal etc.

In the domain of productivity software, a questionnaire designed to do just that

exists. Hassenzahl et al’s (2000) Attrakdiff questionnaire. After examining one of

seven prototypes that varied in terms of their ergonomic quality (i.e. usability) and

hedonic quality (i.e. user experience), Hassenzahl et al’s (ibid.) participants filled in

scales that measured these two factors and the product’s appeal. Both of these factors

were found to correlate with the software’s appeal.

There is little reason to suppose that this isn’t the case for video games; the

prior review has shown that both hedonic and ergonomic factors likely contribute to a

game’s quality and appeal. However, no such questionnaire currently exists for video

games. If we are to aid player testing by creating such a questionnaire, we must first

decide what type of questionnaire to design.

A commonly used type of questionnaire is the Likert scale (Likert, 1932, cited

in Carifio and Perla, 2007). The Likert scale has a very particular process: a large

number (80-100) of statements is generated that relate to a particular concept. Beneath

20
these is a response item where respondents mark how much they agree with the

statement, usually rated 1-5 or 1-7 (see Figure 2.4). These are then rated by a number

of judges in terms of how well they relate to the given concept. Inter-correlations

between these items are then calculated, and the best 10-15 in terms of rating and

inter-correlation are kept as the scale. By then summating the scores from each item to

give a total scale score, we have a scale that measures the respondent’s attitude

towards the concept.

There are two important things to note here: first, that this method is rarely

followed. Instead, the scale is usually administered to a large number of respondents

(100+) which allows factor analysis to be performed (Oppenheim, 1992). This allows

us to determine what the sub-scales that contribute to the scale are and ensure the

reliability and validity of the scale far better than with Likert’s original technique.

Reliability and validity are the key metrics of scale success, with reliability referring

to the degree to which the measurement if free of errors and validity referring to the

usefulness and meaningfulness of a measure (Jensen, 2003).

Second, the scale is not (as is all too commonly believed) the response item

beneath a statement, such as in Figure 2.4 below.

Figure 2.4. Example Likert-type response item.

Figure 2.4 may have scalar properties but is not a scale. As Carifio (2008)

contends, one should never call or treat single response items like a summated scale;

only a summated scale can be considered as measuring at attitude. The whole

21
advantage of scaling is that the summated score increases reliability and validity when

examining attitudes; testing single items massively increases the familywise error

rate.

Using Likert scaling would allow us to quickly generate, pilot and validate a

new questionnaire that can examine many constructs underlying the experiential and

usability properties of a video game. It would do so in terms of participant’s attitudes

to the game that they just played. How to design such a scale is therefore explored

next.

2.3.2 Scale Design and Validation

As per Hornbaek’s (2006) call to improve the practice of usability

measurement, the scale’s content should be based upon research into scale design.

Survey methodologies have progressed sufficiently over the past century or so for

Schaeffer and Presser (2003) to confidently declare that there is no longer an art but

rather a science of asking questions. Whilst the strength of this assertion is perhaps

debateable, it is certainly true that a good deal of research has refined survey and scale

design methods. The following section outlines what could be considered ‘best

practice’ in scale design, providing a number of criteria that any new scale must meet.

Attitude judgements measured by scales reflect the information that was

available at the time – this means that the context at the time the question is asked

causes bias, and the most important such context is the scale construction

(Tourangeau, 1999). The highest level sources of such error are order effects; these

occur when responses to later questions are influenced by the content of earlier

questions.

22
Order Effects

Effects regarding how questions appear to fit into a higher level category to a

respondent are known as assimilation effects (Tourangeau, 1999). For simple Likert

scales this is unavoidable – these effects act as a demand characteristic, and if

respondents grasp the overall purpose of a scale it can bias their responses. However,

if the scale is comprised of subscales the best solution is to separate out questions

from each subscale. Asking questions in close proximity increases the likelihood of

respondents altering their attitudes accordingly to increase consistency among

responses (McGuire, 1960). This may increase correlations between items, but this

greater correlation is illusory and a source of error. Mixing the order of subscales

throughout a scale can reduce this bias, if not eliminate it.

Another major order effect is the part-whole effect (Krosnick, 1999; Lietz,

2008; Martin, 2006), whereby more general questions asked after specific questions

can be misinterpreted – by respondents excluding the content of the specific question

from the general one, for example. The cure for this is simple: ensure that more

general questions are always asked before specific ones.

Question Wording

Moving now to the content of the questions themselves, all sources advise to

keep questions as short as possible (Lietz, 2008; Foddy, 1993; Dillman, Tortora and

Bowker, 1998) with the rule of thumb being a limit of around 20 words per question

(Oppenheim, 1992). Overall scale length should also be minimised, especially when

using web surveys (Ganassali, 2008).The wording of the question is advised to be

kept as simple and unambiguous as possible, avoiding leading questions, ambiguity,

double-barrelled questions (containing multiple clauses) or double negatives (Lietz,

23
2008; Martin, 2006; Foddy, 1993; Krosnick, 1999; Alwin and Krosnick; 1991).

However, a fundamental element of Likert scaling involves including positive and

negative statements about a potential viewpoint (Likert, 1932, cited in Carifio and

Perla, 2007) - which can lead to double negatives. Whilst one should aim to use

wording that will not lead to this – ugly’ as opposed to ‘not attractive’ – this may be

unavoidable. Nevertheless, negative statements also have the added benefit of helping

to reduce the phenomenon of acquiescence – whereby many participants will simply

agree with any statement provided (Hinz et al, 2007). Since they must express their

level of agreement for both positive and negative positions, introducing negative

statements should reduce the strength of this effect (Cox and Cairns, 2008), whilst

there is evidence that both positively and negatively worded items do test the same

construct (Bergstrom and Lunz, 1998), allaying any fears that they may not.

Response Item Design

In terms of the Likert-type scalar response item accompanying each question

on the scale, there are a number of suggestions. Whilst some have suggested that

response items with only three response options are adequate (Jacoby and Matell,

1971) the general consensus is that larger items of 5-7 options are required (Lietz

2008; Krosnick 1999; Preston and Colman 2000; Cox 1980; Lehmann and Hulbert

1972; Colman, Norris and Preston 1997) to ensure an adequate level of reliability and

validity whilst reducing cognitive load on participants. Some suggest that larger items

(of 11 items plus) are desirable (e.g. Dawes, 2001), yet other research has found

indices for reliability and validity improve up to 7 response options (Masters, 1974)

and decrease after 10 items (Preston and Colman, 2000). Indeed, Cox (1980) called it

the ‘lucky number 7, plus or minus two’, in reference to Miller’s (1956) dictum on

24
number span (though not suggesting that the two are linked). Seven response options

are thus recommended.

All of the quantities suggested above are odd, which entails that a midpoint for

the scale is endorsed. Inclusion or exclusion of this will change the data gathered

(Garland, 1991), and whilst some have found that midpoints do not affect scale

reliability (Alwin and Krosnick, 1991), the general consensus is that they do improve

this measure (Lietz 2008; Oppenheim 1992). Good question clarity, meanwhile,

reduces respondents adopting a satisficing strategy (reducing cognitive load by

selecting the first acceptable response) and selecting the midpoint without further

thought (Velez and Ashworth, 2007). The wording for the response option labels

should be a balanced (i.e. ‘like vs. dislike’, not ‘like vs. hate’ etc) and the response

item should be unipolar (i.e. run from ‘1-7’ not ‘-3 to 3’) (Lietz, 2008).

In terms of response item layout, the best layout is to run left to right,

ascending numerically and from negative to positive responses (as in Figure 2.5

below). The opposite has been found to distort results (Hartley and Betts, 2009) and

running left to right better matches the reading direction of Latin text.

Figure 2.5. Response item runs from left to right, ascending


numerically from negative to positive responses.

‘Don’t know’ options are to be avoided, as many participants that fill them in

do have genuine attitudes to provide, and only choose “don’t know” options due to a

satisficing strategy (Gilljam and Granberg, 1993). Labelling every response option

25
improves the quality of the data acquired (Krosnick, 1999), whilst ‘strongly’ and

‘slightly’ are good extreme point and mild qualifiers respectively (Lietz, 2008).

Web Surveys

As for web surveys (important, as the scale being designed may begin life as

one) there is little evidence that paging or scrolling between web survey questions has

an impact on response rate (Peytchev et al, 2006), although we may wish to follow

good usability practice here and avoid scrolling. Finally, any demographics questions

should come at the end of a survey, once respondents are committed to completing the

scale (Lietz 2008; Oppenheim 1992).

2.3.3 Previous Gameplay Scales

Having reviewed good scale design practice, we must now consider the

designs of existing scales. In terms of game experience, a number of scales are in use.

The earliest was probably Witmer and Singer’s (1998) Presence Scale, originally

designed for VR research and the experience of being in a VE. Whilst this is

commonly used in VR studies and has been used in gaming studies (e.g. Eastin and

Griffiths, 2006) it has several flaws. First, some of the questions are overly complex,

making use of the dreaded conjunction ‘or’. For instance, “How much did the control

devices interfere with the performance of assigned tasks or with other activities?” and

“How responsive was the environment to actions that you initiated (or performed)?”

have a great deal of complexity added by the conjunction. Second, some have argued

(such as Slater, 2004) that presence should not be measured solely using scales; an

26
abstract concept such as presence needs more sources of evidence before we can label

an inter-correlation between subscales ‘presence’.

In his thesis, Kim (2006) makes use of a refined version of Chen’s (2004)

Game Engagement Scale (GEQ), the GEQ-R. This scale is rooted in the Presence

Scale, but modifies it for video games. This scale fails to break the concept of

engagement down; whilst there are questions measuring aspects such as control,

graphics etc, no subscales are formed. Since we should avoid making judgements

based upon single response items, this scale allows us to quantify how engaging a

game was – and little else, something that is not useful when attempting to measure

the qualities of a game in more detail.

Ermi and Mäyrä (2005) created a novel scale to measure their SCI model of

immersion. Leaving aside “challenge immersion” (which, as aforementioned, is not

being viewed as a form of immersion), one of the sensory immersion items is

somewhat limited. “The game looked credible and real” would not apply to many

non-realistic games that nevertheless have an immersive sensory experience (Sega’s

Rez again springs to mind). As for narrative immersion, some of the items seem poor

due to the translation into English from the original Finnish (i.e. “I handled also my

own emotions through the game”) which is to be forgiven; the constant reference to

‘characters’ is less forgivable, as it limits the scope of the scale to those games that

involve avatars (e.g. not Real Time Strategy (RTS) where there is no representation of

the player, who instead controls whole armies).

Calvillo-Gamez (2009) developed the Core Elements of the Gaming

Experience Scale (CEGEQ), whilst Jennett et al (2008) designed the Immersion Scale.

Whilst these are not perfect, they are a considerable improvement on what came

before. The former measures enjoyment and frustration in addition to factors relating

27
to Calvillo-Gamez’s puppetry model – control, ownership, facilitators; the shift in

focus from presence and flow is welcome, whilst the balance of positive and negative

statements should reduce acquiescence. Nevertheless, there is still no focus on

usability or playability factors as well as a lack of focus on narrative elements.

Finally, IJsselsteijn et al’s (in preparation) Game Experience Scale (GEQ) has

been used in a number of studies (e.g. Nacke and Lindley 2008; Lindley and Nacke,

2008) with some success. This scale divides flow and challenge (the former is

probably better considered as some form of absorption) and includes subscales that

don’t necessarily aid evaluation (such as ‘competence’ – do high reported levels just

suggest that a game is too easy?) or are genre-specific (‘tension’ is irrelevant for many

games – i.e. life simulation games such as Animal Crossing). Despite this, the GEQ is

well validated and has excellent question wording (using short, simple statements)

and a useful, reduced form for administering between levels/missions etc.

In the productivity software literature, a number of scales measure system

usability. The Attrakdiff (Hassenzahl, 2000) has already been discussed, but the most

widespread is the System Usability Scale (SUS) developed by Brooke (1996). Though

extremely short (10 items!) it has been found to be a remarkably robust measure of

system usability (Bangor, Khortum and Miller, 2008), effective on a multitude of

systems. However, it measures usability on a single scale; we might wish for a more

fine-grained analysis. Lewis (1995) designed the Computer System Usability Scale

(CSUQ) which does comprise of subscales – system usefulness, information quality

and interface quality. Tullis and Stetson’s (2004) review of usability scales found

these two to be the most robust, with a sample of 12 users providing the correct

findings (i.e. the same score as a larger sample) 90-100% of the time, and a sample of

28
10 users 75-80% of the time. In all, this suggests that such scales can be short,

measure multiple factors yet still be reasonably robust when testing small samples.

Clearly, selection of the correct constructs is paramount, and the preceding

review has shown what aspects should be measured. None of the existing battery of

scales provides a truly integrated measure of a game’s appeal or quality, although they

should influence the content of any new scale. If a scale is to accurately assess a

game, it must measure all of the pertinent factors. To address this need, the task now

is to create a new validated scale, before we can determine if such a scale can reliably

measure experience, challenge, playability and usability factors and predict video

game review scores.

29
3 Study #1: Scale Construction, Validation and Refinement

3.1 Rationale

The literature review has shown that experience, challenge, playability and

usability (broken down into the elements in Table 3.1 below) all need to be measured

if we are to assess a video game. A scale that measures these factors now needs to be

created, following the good design practice noted in section 2.3.2. As was discussed

in section 2.3.1, the first step is to create a large pool of questions, test them on a large

population and determine what factors emerge.

The questions that this first study seeks to answer are whether a scale

developed along these principles can measure the appeal and quality of a video game

reliably and accurately, and whether or not the scale has good validity. To determine

this, participants will also complete a modified version of Hassenzahl et al’s (2000)

Appeal Scale, it being hypothesised that if the two scales correlate then the

‘Gameplay Scale’ is indeed measuring factors that influence a game’s appeal – and

therefore has good validity.

3.2 Questionnaire Construction

3.2.1 The Gameplay Scale

The main scale being devised, named the Gameplay Scale, aimed to measure

15 elements of video game experience divided into 4 factors as in Table 3.1 below.

30
Table 3.1. Four Main Factors of Video Game Experience and Sub-Constructs

Factors Mediating Video Game Experience


Experience Challenge Playability Usability
Fictional Immersion Challenge Variety Control
Sensory Immersion Absorption Clear Goals Customisability
Affective Valence Ownership Navigation Consistency
Help/Training Camera (Views)
Game Interface
(This is the same table as Table 2.2, redrawn here for clarity)

As Table 3.1 illustrates, on the basis of the literature review the factors of

experience, challenge playability and usability are expected to emerge; there should

thus be four subscales to the main scale after analysis. The player experience question

content was drawn from existing questionnaires (notably, those of IJsselsteijn et al, in

preparation; Ermi and Mäyrä; 2005, Calvillo-Gamez; 2009 and Jennett et al; 2008),

modifying these questions as per the good questionnaire design review in section

2.3.2 and the review of how to measure player experience in section 2.1. This

included questions such as “I thought that the game was fun” and “I felt the game was

hard”. The usability and playability questions were drawn from the gameplay

heuristics reviewed in section 2.2 (i.e. those of Febretti and Garzotto, 2009; Desurvire

and Wiberg, 2009; Pinelle, Wong and Stach, 2008; Federoff, 2002 and Korhonen and

Koivisto, 2006), and were divided into playability and usability factors as in Table

3.1. These included questions such as “I found the controls to be difficult” and “The

game provided me with an adequate tutorial”.

This initial version of the scale had 49 questions (see Appendix A), 3-4 per

element - meaning that any of the elements could still be measured if subsequent

analysis showed that it formed a factor independent of anything else. These questions

were refined via pilot testing and each question had a 7-point Likert response item

31
with labels on every point. The aim was for the following analysis to remove items

with poor inter-correlations or reliability and thus leave a smaller, more accurate

scale.

Initial pilot testing involved a cognitive interview, as described by Fowler

(2002). This involves participants “thinking-aloud” as they work though the scale and

was performed (after a short gaming session) by 4 participants, resulting in substantial

amendments to the scale items and ordering.

3.2.2 The Appeal Scale

The Appeal Scale was a modified version of Hassenzahl et al’s (2000) appeal

scale; using 8 semantic differential items with a 7-point scale (see Appendix D).

Following the pilot testing (as explained above) the original “sympathetic-

unsympathetic” pair, was deemed unsuitable for testing videogames and was replaced

with a “fun-boring” differential. This was viewed as a separate scale to the Gameplay

Scale, as it used a different form of response item and was being used to assess the

construct validity of the Gameplay Scale.

3.3 Methods

3.3.1 Participants

In all, there were 132 respondents to an online version of the scale that

respondents accessed via a link to the survey posted (along with a substantive

description of the survey) on both a private PlayStation Beta Testers forum and on the

32
public official (EU) English PlayStation Forums

(http://community.eu.playstation.com/playstationeu/?category.id=55). This self-

selecting sample was reduced to 98 respondents (M: 25 years old, SD: 6.96; 7%

female) once partial responses had been filtered out. The gender bias is noted as a

limitation for generalising the findings.

3.3.2 Materials

The study used an online version of the questionnaire that included both the

Gameplay Scale and the Appeal Scale that was created using the SurveyMonkey

survey software (http://www.surveymonkey.com).

3.3.3 Procedure

Participants were asked (via the forum posting) to play (the free downloadable

demo of) Q Games’s game PixelJunk Eden (http://pixeljunk.jp/), which was viewed as

a reasonably simple yet aesthetic platformer/puzzle game. In PixelJunk Eden, players

play as ‘The Grimp’, manoeuvring around a two-dimensional level (by swinging

around or jumping) attempting to collect pollen that causes plants on the level to grow

and allow other objectives to be met. This game was selected as it should raise

interesting usability and playability issues (due to its novel premise and mechanics)

whilst the demo is freely available to anyone on the PlayStation Network (PSN).

Participants were requested to play the game for up to 2 hours (with a 15 minute break

in the middle) and then complete both of the scales. Participants were informed that

their participation was voluntary, that they were free to leave the study at any time

33
and that all data gathered was confidential. The study followed BPS ethical

guidelines; ethics committee approval was sought and gained for the study, though no

major ethical issues were foreseen

3.4 Results and Analysis

3.4.1 The Gameplay Scale

The first stage was to determine how the variables (in this case, each scale

item) in the gameplay scale clustered. A hierarchical cluster analysis was performed

on the 98 responses, using Ward’s method. (This method was chosen as it is widely

regarded as provided the most accurate and robust clustering solutions (Scheibler and

Schneider, 1985) Selecting a clustering solution is as theory-driven as it is data driven

(Thorndike, 1978), so a number of clustering solutions were outputted and the best

selected on theoretical grounds. In this case, the clustering solution that fitted most

closely with the expected 4 part structure to the scale was selected. This 4 cluster

solution closely mirrored the expected experience/challenge/ usability/playability

structure, as can be seen in Appendix B. The first cluster generally comprised of

Affect, Sensory Immersion and Fictional Immersion questions; the second mostly

consisted of Absorption and Challenge questions; the third included Navigation,

Goals and Consistency and the fourth included Help, Controls and Menus. A few

anomalies (such as some Controls and Menu items being clustered with the Affect

and Immersion items) did occur however.

The scale as a whole had good reliability in terms of internal consistency, with

a high Cronbach’s alpha of 0.933. However, we have little certainty that it functions

34
as a unified scale measuring ‘game quality’ or similar, so cluster reliability is more

important. Additionally, we need some way to validate the clustering solution

selected; if the clusters possess good reliability it suggests that we can treat them as

subscales.

Of the 4 clusters, cluster 1 was named ‘Affective Experience’, cluster 2

‘Focus’, cluster 3 ‘Playability Barriers’ and cluster 4 ‘Usability Barriers’ – the reasons

for naming the subscales this way are considered in depth in the discussion section.

Affective Experience had an alpha of 0.933; Focus had an alpha of 0.757; Playability

Barriers’ alpha was 0.857 and Usability Barriers’ was 0.783. In short, all of the

clusters had good reliability, with alphas over 0.7. We thus have good reason to

consider them subscales. The next task was to reduce the scale in size, removing

questions that did not add (or indeed, detracted from) each subscale’s reliability. This

was done in a stepwise process, with the impact on subscale reliability checked each

time an item was removed. This process continued until the impact of removing any

more items would reduce subscale reliability too much, lessen cluster integrity or

leave constructs unaccounted for. The results of this are shown in Table 3.2 below. As

the table illustrates, all the 26 remaining items had a significant correlation to both

scale and subscale total scores.

This revised scale had a Cronbach’s alpha of .903, whilst the scales still had

good reliability: Affective Experience had an alpha of 0.903; Focus had an alpha of

0.711; Playability Barriers’ alpha was 0.814 and usability Barriers’ was 0.760. It

seems reasonable to therefore consider these clusters as subscales measuring specific

constructs.

35
Table 3.2. Revised Gameplay Scale Items and Correlations to Scale and Subscale Scores

Question Item Construct Subscale Correlation Correlation


# to Subscale to Scale
Total Total
1 I enjoyed the game. Affect AE .821(**) .704(**)
5 I thought that the game was fun. Affect AE .778(**) .713(**)
21 I found the appearance of the game Sensory AE .661(**) .521(**)
world to be interesting.
43 The aesthetics of the game were Sensory AE .695(**) .560(**)
unimpressive. ***
45 The game failed to motivate me to Ownership AE .882(**) .764(**)
keep playing. ***
47 I wanted to explore the game world. Fictional AE .831(**) .740(**)
3 I was focused on the game. Absorption F .558(**) .405(**)
4 I could identify with the characters. Fictional F .505(**) .411(**)
20 I was unaware of the passage of Absorption F .547(**) .405(**)
time whilst playing.
23 I forgot about my surroundings Absorption F .580(**) .287(**)
whilst playing.
38 I found the game mechanics to be Variety F .524(**) .462(**)
varied enough.
41 I thought about things other than the Absorption F .656(**) .499(**)
game whilst playing. ***
42 My field of view made it difficult to Camera F .497(**) .483(**)
see what was happening in the
game. ***
44 I thought the camera angles in the Camera F .565(**) .498(**)
game were appropriate.
48 I thought the level of difficulty was Challenge F .487(**) .474(**)
right for me.
15 I always knew where to go in the Navigation PB .777(**) .442(**)
game.
27 I knew how the game would Consistency PB .669(**) .592(**)
respond to my actions.
28 I always knew how to achieve my Goals PB .781(**) .567(**)
aim in the game.
30 My objectives in the game were Goals PB .759(**) .662(**)
unclear. ***
37 I couldn't find my way in the game Navigation PB .703(**) .543(**)
world. ***
8 The game trained me in all of the Help UB .579(**) .414(**)
controls.
12 I knew how to use the controller Controls UB .641(**) .371(**)
with the game.
14 I found the game's menus to be Menu UB .708(**) .473(**)
usable.
16 I knew how to change the settings in Settings UB .613(**) .241(*)
the game.
24 I found using the options screen to Settings UB .717(**) .347(**)
be difficult. ***
36 I found the game's menus to be Menu UB .650(**) .503(**)
cumbersome. ***
* Corr. Statistically significant to 0.05; ** Corr. Statistically significant to 0.01
*** Negative question – scoring reversed.
(AE= Affective Experience; F =Focus; UB= Usability Barriers; PB= Playability Barriers)

36
3.4.2 The Appeal Scale

As for the Appeal Scale, 94 participants completed it, and all items were

significantly correlated to one another whilst the Cronbach’s alpha was high at 0.939,

suggesting good scale reliability. All scale items had significant Spearman’s

Correlations (p < 0.01), whilst Factor Analysis (which could be performed on this far

smaller scale) found all of the items to fall into one factor with an Eigenvalue of 5.629

that accounted for 70% of the total variance. This allows us to view the Appeal Scale

as measuring one underlying construct – the game’s appeal.

3.4.3 Inter-Scale Correlations

Finally, relationships between the subscales were considered. Spearman’s

Correlations between the Gameplay Scale, the four subscales and the Appeal Scale

revealed that they were all highly correlated with both each other and with the total

score for the scale (See Table 3.3 below). As Table 3.3 illustrates, all of these

correlations were highly significant. This suggests that there may well be an overall

construct of “game quality” to which each of the constructs measured by the relevant

subscale contributes. It also implies that the four subscales measure appeal.

37
Table 3.3. Spearman’s Correlations between Subscales, the Gameplay Scale and the Appeal
Scale for Study 1

Affective Playability Usability Gameplay


Appeal Focus
Experience Barriers Barriers Scale
Appeal Scale - .757 .576 .531 .358 .746

Affective .757 - .587 .513 .366 .839


Experience
Focus .576 .587 - .351 .321 .775

Playability .531 .513 .351 - .480 .737


Barriers
Usability .358 .366 .321 .480 - .606
Barriers
Gameplay .746 .839 .775 .737 .606 -
Scale
(All correlations statistically significant to p < 0.01)
(For Appeal Scale correlations, N = 94; for all others, N = 98)

Construct validity was investigated by multiple regression of each of the

Gameplay Scale subscales against the Appeal Scale. This would allow us to

determine which of the subscales best predicted the appeal rating - if the subscales

correlated with the Appeal Scale we have good reason to infer that they are measuring

elements of the game’s appeal and thus that the scale had good construct validity.

This found an R² of 0.731, suggesting that the constructs measured by the four

subscales collectively account for 73% of the variance in a game’s appeal. The overall

effect was significant (F(4,93)=7027.668, p = 0.000); moreover, the Affective

Experience subscale had the highest contribution, with a beta coefficient of 0.599,

followed by Focus (0.209), Playability Barriers (0.164) and Usability Barriers (-

0.003). Only the contribution of the Usability Barriers scale was non-significant

(p>0.05). All in all, these results imply that each of the subscales measure constructs

that are related to a game’s appeal, but that in this case the Usability Barriers measure

did not make a significant contribution. This either entails that Usability Barriers do

38
not contribute to a game’s initial appeal at all or that they did not influence this

game’s appeal.

3.5 Interim Discussion

The aim of this first study was to construct and refine the questionnaire,

ensuring that it and its subscales had good reliability and, by its correlating with the

Appeal Scale, good construct validity.

Overall, this interim stage in the validation of the scale has refined the number

of items in the scale, based on how well the items fit into clusters as revealed by

cluster analysis. However, most scale construction involves using the statistical

method of Exploratory Factor Analysis to determine what underlying variables are

represented by each item in the scale and thus ensure good content validity. This is a

large sample technique, and the 98 responses collected may not be enough for

Exploratory Factor Analysis to be used successfully. It is generally maintained that a

larger sample is needed - a minimum of 250 participants or more (e.g. Guilford, 1954)

or a ratio of 5:1 or 10:1 participants to items or more (e.g. Everitt, 1975) being

recommended. As such, Cluster Analysis is often recommended for scale

development involving smaller samples (i.e. Thorndike, 1978). Whilst the technique

lacked the statistical rigour of factor analysis it does assign all of an item’s variance to

a particular cluster – making it ideal for dividing questions into subscales. Indeed, this

method is what Witmer and Singer (1998) used when developing the presence scale.

If we accept the cluster analysis, the four cluster solution broadly matched the

expected structure (given the literature review) of Experience, Challenge, Usability

39
and Playability subscales with a few key differences, whilst the stated aim of using

the analysis to select the most effective questions and reduce the size of the scale was

achieved. The first cluster was very similar to the hypothesised ‘Experience’ factor,

but with a slight shift towards the affective aspects of such experience, hence it now

being called the ‘Affective Experience’ subscale. One anomaly was the inclusion of

‘Controls’ and ‘Menu’ construct items (Q25 and Q34 respectively in the original

Gameplay Scale) in this cluster. However, both of these items included the word

‘intuitively’, thus it is likely that the use of this word biased responses by adding a

more experiential element to the judgement being requested.

The ‘Challenge’ subscale that was initially hypothesised ended up being less

about challenge and more about cognitive absorption (and included the Camera

items), hence it now being called the ‘Focus’ subscale. It still covers challenge, but

absorption is taken to be the key measurement here; the level of absorption or focus

influences the player’s ability to engage in a challenge and is itself a result of

acceptable challenges. This scale involves absorption, challenge and variety, as well

as two interesting results. Items for the ‘Fictional Immersion’ construct correlated to

both the Affective Experience and Focus subscales (see Appendix A). One’s

experience of the game world (via Q47 in the original Gameplay Scale) was part of

the Affective Experience cluster; one’s ability to empathise with characters (Q4 in the

original scale) was part of the Focus cluster. Previous studies have found empathy to

correlate with absorption (Wickramasekera, 2007); it seems quite plausible that

absorption is required for one to empathise with another, so this relationship is

accepted.

Less understandable is the high correlation of the ‘Camera’ construct items

with the Focus cluster. Whilst it is plausible that one’s perspective of a game and the

40
ease at which this can be altered could mediate one’s ability to focus on the characters

and challenges, this is still a supposition, as we have no definite evidence of this

relationship. As such, this is precisely the sort of cluster analysis result that we should

be wary of. The subscale will be kept as it is for now, but this must be considered

again after the second part of the scale validation. Indeed, removing these items now

was found to negatively impact the subscale’s reliability, which is why the Focus

scale is (at 9 items) much larger than the other three.

The items in the other Playability Barriers and Usability Barriers were much

as expected and offered few surprises. A plus point for these subscales (along with the

other two) is that all of the expected constructs are still covered by at least one item,

meaning that all of the various elements of game usability and playability can be

measured. We may not expect these various elements to form a consistent scale (e.g.

because consistency may be poor in a game but navigation and goals good) but as

Sauro and Lewis (2009) found, there is a general construct of usability, which

suggests that users form an overall impression of a system’s usability which informs

their judgements for the usability of each element of the system; the same is likely

true for playability. Thus the players likely formed an overall impression of

playability and usability which resulted in the correlation between their judgments

about each element of the game.

It could be argued that this is problematic, as it lessens our trust in user’s

judgements. – How do we know that the navigation in the game is poor if the user’s

overall ‘playability’ judgement for the game influences their attitude towards this

construct? However, if true, this would also be an issue for many usability evaluation

methods (such as’ think-aloud’ user testing, interviews and focus groups) and simply

underlies both the fact that individual Likert items should not be analysed (at least not

41
formally) and that triangulation using numerous methods is required to identify the

source of poor usability and playability; methods such as user testing and interviews

may be more appropriate here.

The Usability Barriers subscale is so called because it measures the severity of

barriers to player engagement that are rooted in usability issues. This did not

significantly contribute to the game’s appeal this time, but this suggests that no major

usability issues arose. Such barriers are likely to significantly detract from a game’s

appeal when they are severe, but fade into the background when usability and

playability are good – and thus contributed little to a game’s appeal (though a broader

survey of games would be required to establish this)

Finally, the strong inter-correlations between subscales are suggestive that an

overall measure of ‘game quality’ or similar can be measured by the scale; whilst it

would be useful for the scale to function like this, it is too early to claim that such a

measure exists. Either way, the strong relationship between the Gameplay Scale and

the Appeal Scale suggests that the Gameplay Scale is indeed measuring something of

interest, and that it will be worth continuing to validate the scale. It must be noted

(Hassenzahl et al (2000) failed to) that the Appeal Scale is really a measure of initial

appeal. How a user views any system (games included) will change over time

(Grodal, 2000) – something especially true of a large game world that is to be

explored. Thus in the ~2 hours of play that respondents had, they could only form

their initial attitudes towards the game.

These attitudes are still important, however. In an age when most major games

will have a free playable demo available before their release, ensuring that this initial

appeal is high is very important for a game to sell well. Indeed, it is likely this initial

appeal that motivates players to continue playing a game. However, further calibration

42
of the scale is required before we can state that a game scored well or scored badly on

any of the qualities that it measures.

This study has thus established that the four elements of a video game’s

quality and appeal– Affective Experience, Focus, Usability and Playability – are

relevant when assessing video games; that the Gameplay Scale designed is reliable

and that it does measure a video game’s initial appeal. However, further analysis is

required to determine if these results generalise to different game genres and to

determine if the scale can predict review scores. A further study was therefore

performed to investigate this.

43
4 Study #2: Further Exploration of Scale Validity

4.1 Rationale

The previous study established the constituent elements of the Gameplay

Scale, reduced it to 26 items and suggested that it has good construct validity. To

further refine the gameplay scale and ensure that the revised version had good

construct validity a further study was undertaken. This study aims to determine if the

findings of the previous study can be generalised to other game genres, and to

determine if the Gameplay Scale correlates with review scores, which would

demonstrate the Gameplay Scale’s usefulness to industry. To do this, a player testing

study in which participants would play the game in a laboratory before completing the

Gameplay and Appeal Scales was performed. This would determine if the Gameplay

Scale would still correlate to initial appeal for open-world games.

The ‘open-world’ or ‘sandbox’ genre involves providing players with a large

environment and allowing them to choose which tasks they perform in a highly non-

linear way. The open-world genre was selected for two reasons – to ensure that the

Gameplay Scale had good generalisability and could be used to determine the appeal

of games in many genres (since these non-linear games are very different to the linear

PixelJunk Eden used in study 1) and because there should be interesting playability

and usability issues arising in this genre, especially regarding navigation, camera and

controls as path-finding in open-worlds can be difficult. Controlling the game genre in

this way lessens the ecological validity of the study and our ability to generalise these

findings to all genres of game; however, it does increase internal validity by

44
controlling for each player’s genre preference and allows for more reliable between-

group comparisons

The two games selected, Radical Entertainment’s Prototype

(http://www.prototypegame.com/) and Sega’s The Incredible Hulk

(http://incrediblehulkthegame.com/) - see Figure 4.1 - were selected as they were

both very similar in terms of genre (open-world action games involving a super-

powered protagonist, with a linear first level) and setting (modern-day New York

City) but had been reviewed very differently on the review compilation site Metacritic

– with aggregate ‘metascores’ of 79 and 55 (out of 100) respectively - see

http://www.metacritic.com. Prototype was thus fairly well reviewed (though by no

means perfect) whilst Hulk was quite poorly reviewed (Whilst 55% would usually

seem like an average score, video game review scores are usually shifted higher, with

70% or so being an average review score; see

http://www.joystiq.com/2006/08/07/ign-gamespot-review-score-inflation-revealed/

for more).

Figure 4.1. Screenshots from the games The Incredible Hulk (left) and Prototype (right).
Both images from www.ign.com.

45
Therefore, the further aim of this study was to determine if the Gameplay

Scale would give significantly higher ratings to games given higher metascores than

games given lower metascores. It is debatable whether review scores are a perfect

metric to which the scale can be compared (especially for initial appeal), but given

that the difference between the metascores (compiled from 51 (for Prototype) or 26

(for Hulk) magazine and website reviews) is significant (p < 0.01) we should

nevertheless expect our scale to detect this apparent difference in game quality if it

measures factors that influence a game’s perceived quality.

The experimental hypotheses were that a well reviewed game would be rated

significantly more highly on all or at least some of the Gameplay subscales and the

(initial) Appeal Scale than the non-well reviewed game and that all (or at least some)

of the Gameplay subscales would correlate with the (initial) Appeal Scale.

4.2 Methods

4.2.2 Participants

Seventeen subjects participated in the experiment (M: 23.6 years old, SD: 2.0;

4 females). The participants were recruited in an opportunistic sample around the

university campus and paid £6. The participants were recruited using the following

criteria to ensure that they were in the target demographic for the games being tested:

 They were 18-27 years old

 They had not played the test game

 They regularly played games (a median of 5-10 hours per week)

46
 They had played open-world games for at least 6hrs+; most had played the

genre for 30hrs+

 They owned an average of 2 gaming platforms (including PC).

These participants were then randomly assigned to one of the two groups (n = 9

for the game Prototype; n = 8 for the game Hulk); of the four females, two were

assigned to each game to counter-balance the genders.

Figure 4.2. The DualShock 3 Controller. Like most


modern controllers, this has vibrating tactile feedback or
‘rumble’. From www.gizmodo.com

4.2.3 Materials

A PlayStation 3 console was connected to an LCD colour projector in a

laboratory. One of two games was projected onto a wall in the lab. Players sat in front

of the wall and played the game using a standard DualShock 3 controller (see Figure

4.2 above) Lights in the lab were dimmed to enhance player concentration on the

game. A digital video camera (filming in night mode) was positioned to the front of

47
the player to capture their reactions to the game; whilst a researcher sat to the rear of

the lab in order to take notes on their activities in the game world. Printed copies of

the revised Gameplay Scale (see Appendix C) and Appeal Scales (see Appendix D)

were used in this study.

4.2.1 Design

The experiment had a one-way between-subjects design – one group played

Hulk and the other group played Prototype. The independent variable was the game

that each group played (and the review score given to that game); the dependent

variable being the difference in subjective ratings (on both the Gameplay and Appeal

Scales) that participants gave to the games, which should give us a measure of the

game’s initial appeal.

4.2.4 Procedure

Participants played one of the games for one hour (including any non-playable

cut scenes) in the laboratory whilst being videoed. Participants then completed the

Gameplay and Appeal Scales by hand before being given a short, semi-structured

interview. It is maintained that the 1 hour play session gives enough time to form

attitudes about most of the game’s aspects. The video and interview data were

gathered to resolve any intractable issues that may arise from the scale data; as such,

they were not the focus of the investigation. The study followed BPS ethical

guidelines; ethics committee approval was sought and gained for the study, though no

major ethical issues were foreseen.

48
4.3 Results

4.3.1 Significance Testing Between Groups

The overall scale scores and mean subscale scores were first calculated for

each participant; these are summarised in Table 4.1 below. As Table 4.1 shows,

Prototype was given higher ratings than Hulk on every single scale. Shapiro-Wilk

tests found all groups of data to be normally distributed (all p > 0.05) with the

exception of the Hulk Appeal Scale data (p = 0.02), whilst all of the data sets passed

Levene’s test for equality of variance (all p > 0.05) with the exception of the

Playability Barriers (PB) data (p = 0.012); as parametric tests are normally robust to

such minor violations of their assumptions a MANOVA test was performed on the

data.

Table 4.1. For Each Game, the Mean Score for Each Subscale and the Mean Scale
Scores, with Standard Deviation
Scale Prototype Hulk
M SD M SD

Affective Experience Subscale 34.111 4.935 29.750 4.920

Focus Subscale 44.444 5.174 41.125 5.436

Playability Barriers Subscale 24.555 6.267 24.250 3.105

Usability Barriers Subscale 31.444 4.034 25.375 3.204

Summed Gameplay Scale 134.555 11.673 120.500 9.242

Appeal Scale 42.777 7.446 38.750 5.849

However, including the summed Gameplay Scale data in the analysis resulted

in singularity, preventing Box’s M (a measure of covariance) from being calculated.

49
This is a violation of one of the key assumptions of MANOVA, so the summed

Gameplay Scale data (seen in Figure 4.3 below, which shows that Prototype was

given higher scores than Hulk) were thus analysed in a separate one-tailed

independent samples t-test. The Bonferroni correction was applied to the p value to

account for the additional test (the correction is new p = p/n where n is the number of

analyses being performed) and prevent the familywise error rate from increasing.

Including all of the analyses to be performed, n= 6 and this entailed that the new p =

0.008. The t-test revealed that for the summed Gameplay Scale, Prototype was scored

significantly higher than Hulk (t (15) = 2.73, p = 0.000) by participants, as

hypothesised. The ƞ² = 0.33, a large effect size that suggests 33% of the total variance

in the GS scores is a result of varying the game (generally, .01 is a small effect size,

.06 and above is moderate effect size and .14 and above is a large effect size (Cohen,

1988, cited in Pallant, 2001).

Figure 4.3. Mean summed Gameplay Scale score and SD for


each game. (‘1’ = Prototype; ‘2’= Hulk)

50
The MANOVA analysis of the average subscale scores, sans the summed

Gameplay Scale data, could then be performed as Box’s M = 12.302, p = 0.924,

suggesting that the data had homogeneity of covariance matrices - a key assumption

for MANOVA analysis. Only the Usability Barriers subscale showed a significant

difference between games (F (1, 15) = 11.580, p = 0.004); for all other measures p >

0.008. The partial ƞ² = 0.44 for the Usability Barriers subscale, meaning that 44% of

the subscale variance was accounted for by the game manipulation. Interestingly, for

the Affective Experience subscale the partial ƞ² = 0.18, for the Appeal Scale the

partial ƞ²= 0.09, whilst for the Focus subscale the partial ƞ² = 0.1 – a large and two

moderate effect size respectively, despite the difference being non-significant.

4.3.2 Inter-Scale Correlations

The other important question involved determining which (if any) scales

correlated with the Appeal scale. The inter-scale correlations for the scale sums are

shown in Table 4.2 below. Note that both groups were added together to ensure a

large enough sample for correlational procedures; any correlations that exist should

hold for both games given their high level of similarity.

51
Table 4.2. Spearman’s Correlations between Subscales, the Gameplay Scale and the Appeal
Scale for Study 2. For each correlation value, its statistical significance is also reported.

Affective Appeal Focus Playability Usability Gameplay


Experience Scale Barriers Barriers Scale
Affective - 0.779** 0.711** 0.034 0.129 0.735**
Experience
Appeal 0.779** - 0.737** 0.095 0.143 0.708**
Scale
Focus 0.711** 0.737** - 0.125 -0.061 0.760**

Playability 0.034 0.095 0.125 - 0.082 0.521*


Barriers
Usability 0.129 0.143 -0.061 0.082 - 0.394
Barriers
Gameplay 0.735** 0.708** 0.760** 0.521* 0.394 -
Scale
** Correlation is significant at the 0.01 level (1-tailed).
* Correlation is significant at the 0.05 level (1-tailed).

As Table 4.2 and Figure 4.4 (below) illustrate, significant correlations exist

between the Affective Experience, Focus, Appeal and overall Gameplay Scales (p <

0.01); the only other significant relationship is between the Playability Barriers and

Gameplay Scales (p = 0.015). To further ensure construct validity, simple linear

regression was performed, this time only between the Appeal and summed Gameplay

Scales with both groups again added together as linear regression requires a minimum

of at least 15 participants per dependent variable to be accurate (Stevens, 2002). This

found an R² of 0.577, suggesting that 58% of the variance in the Appeal rating given

by participants of both groups can be considered due to factors measured by the

Gameplay Scale, the subsequent ANOVA finding this relationship to be significant

(F (1,15) = 22.786, p = 0.000).

52
Correlation Between Gameplay and Appeal Scale Scores

55
Appeal Scale 50
45
40 Score
35 Linear (Score)
30
25
20
100 110 120 130 140 150 160
Gameplay Scale

Figure 4.4. Correlation between Gameplay and Appeal Scale scores, with line of best
fit.

4.3.3 Comparison with Study 1 Data

Finally, the data gathered in study 2 was compared to the data gathered in

study 1. This involved performing a one-way ANOVA between the Prototype, Hulk

and (study 1 game) PixelJunk Eden Gameplay Scale mean scores (seen below in

Figure 4.5), followed by two planned comparisons: PixelJunk Eden vs. Prototype and

PixelJunk Eden vs. Hulk. PixelJunk Eden was given a metascore of 80; we should

expect its Gameplay Scale score to be significantly different to Hulk but not to

Prototype. Five outliers (as identified by the SPSS boxplot) with Gameplay Scale

scores of 80 or below were removed from the PixelJunk Eden data before analysis.

The Levene’s statistic was narrowly significant (p = 0.046) meaning that the planned

comparison statistic in which equal variances are not assumed was used The ANOVA

found no significant difference between the scores for the games (F (2,107) =

1.81867, p = 0.16); however the planned comparison shows the PixelJunk Eden vs.

Prototype score difference to be non-significant (F (1, 11) = 0.5329, p = 0.09) and the

PixelJunk Eden vs. Hulk comparison to be significant (F (1, 11) = 7.29, p = 0.02).

53
Mean Gameplay Scale Score s for Each Game

150
140
130
Mean Score

120
110 Prototype PixelJunk
100 Hulk
90
80
Game

Figure 4.5. Mean Gameplay Scale scores for each game. Error bars show standard error.

4.4 Interim Discussion

The second study aimed to examine the generalisability of the Gameplay Scale

and the results from study 1 by having participants play open-world action games. It

also aimed to determine if the Gameplay Scale could distinguish between games

based on their review scores. The study found that, using the Gameplay Scale, players

rated Prototype significantly higher than The Incredible Hulk, allowing us to reject the

null hypothesis. As Prototype was significantly better reviewed than Hulk by the

gaming press, this suggests that the scale has good construct validity. This is further

confirmed by the scale correctly ranking the PixelJunk Eden data. As PixelJunk Eden

received almost the same metascore as Prototype, that the scale found no-significant

difference between those two games but did find a significant difference between

PixelJunk Eden and Hulk further suggests that the scale is measuring game quality in

some way – thus making it useful to player testing. Furthermore, the Gameplay Scale

correlated with the Appeal Scale, suggesting that the scale does measure the initial

appeal of a game; it accounted for 58% of the variance in player appeal ratings which

54
gives the scale good construct validity. Finally, the large overall Gameplay Scale

effect size showed that 33% of the variance in the scores came from varying the game

played (and not just individual differences or error), again demonstrating that the scale

can detect differences in game appeal and quality.

Moving to the specific subscales, the picture provided by the analysis is less

clear. There being no significant difference in Prototype’s or Hulk’s Appeal Scale

scores as well as the Affective Experience and Focus subscales correlating most

strongly with the Appeal Scale both undermine the construct validity of the scale as

there was no significant difference between each game’s scores on these subscales.

Moreover, the only subscale with a significant difference between each game’s scores

was the Usability Barriers subscale; a large effect was found to exist, with 44% of the

variance being explained by the manipulation. However, reviews (e.g. Goldstein, 09

June 2008; McGarvey, 17 June 2008) generally focused on the high level of repetition

in Hulk as its major failing – a Focus and not a Usability Barriers issue, according to

the scale at least.

How then do we account for these apparently anomalous results? The first two

are most likely due to the small sample size; despite its flaws, Hulk still provided an

enjoyable experience for the hour that participants played it (no participant stated that

they didn’t enjoy playing it) , and thus the difference in initial appeal is probably

rather small. The medium effect size for the Appeal Scale (ƞ²= 0.09) suggests that

with even a few more players, a significant difference is likely to be found.

As for the Affective Experience and Focus subscales, the same is probably

true. With effect sizes of ƞ²= 0.18 and ƞ²= 0.1 respectively, it is clear that

manipulating the game has a considerable influence of these scores, even if the

55
difference is not significant. With a larger sample (n = 15+) it is hypothesised that a

significant difference will be found.

Given the above considerations, why should the Usability Barriers scale show

a significant difference? It is likely that during initial play experiences, usability

barriers are especially pronounced; players have to grapple with new control systems,

menus and ranges of settings with which they are not familiar, so a larger effect size is

perhaps to be expected. From observing the play sessions, these indeed appeared far

worse for the Hulk, with many players having a lot of (initial) trouble with the

healing, climbing and targeting controls, as well with the menu screens. That isn’t to

say that there were no issues with Prototype; the ‘disguise’ mechanic in this game was

very poorly explained and thus resulted in a lot of player confusion.

However, these usability problems did not correlate with the Appeal Scale,

suggesting that Usability Barriers factors do not determine a game’s initial appeal (the

same was found in the study 1 regression analysis – see section 3.4.3). Referring to

the interview transcripts, most players did not view the game as having poor usability

even if they struggled with a game mechanic unnecessarily – e.g. from the interviews:

“Researcher: Do you think the game could have made that [the picking up

objects mechanic] clearer?

Hulk Participant 8: Maybe. But maybe I could have concentrated more”

“Hulk Participant 4 [On having difficulty with the jumping controls] …I

think it’s a good thing, because I’ve never played a game where you jump

like that before, so it obviously takes a bit of getting used to…”

56
This implies that ‘core gamers’ (i.e. those that play regularly) expect some

degree of struggle when it comes to learning a new game’s control - unless the

usability problems are very severe, it is only when these problems persist through

long-term play experiences that they inhibit play. The opposite is likely true for casual

gamers (i.e. those that do not play very often).

For the same reason, the reviews did not mention these usability issues, largely

because the longer play session had resulted in the reviewers transcending these initial

usability problems, and instead encountering Focus issues (such as a lack of variety)

that were not present in the initial play session. The interesting result that still requires

explanation, then, is how the Gameplay Scale was able to arrange the games in order

of their metascores despite the most significant difference being in a subscale that

determined neither the review scores nor the game’s initial appeal to players (indeed,

the Usability Barriers subscale didn’t actually correlate with the scale as a whole for

study 2). The first thing to note is that, despite having non-significant differences with

Hulk, both PixelJunk Eden and Prototype scored higher than Hulk for almost all

subscales; this would have contributed to the overall significant difference, meaning

that these factors were still important in ranking the games. Second, the poor

Usability Barriers ratings for Hulk likely counteracted the Affective Experience and

Focus scores that the game received, which were probably boosted by the short play

session that prevented much repetition from occurring. Thirdly, it is possible that

games that suffer from such initial usability problems are more likely to be poorly

designed in a way that can also inhibit the long-term appeal of the game and thus

review scores.

Finally, the Playability Barriers subscale showed no indication of either a

meaningful effect size or of a significant difference between games. Whilst a larger

57
sample could again remedy this, another cure would be to broaden its scope and

include some of the playability issues that were excluded from the initial scale (such

as rewards, feedback, etc) for the sake of brevity; this would make the instrument

more sensitive to differences in playability between games.

Despite some anomalous results with the individual subscales, study 2 has

shown the Gameplay Scale as a whole to have good construct validity by measuring a

video game’s initial appeal and good generalisability by measuring differing genres.

In addition, both construct validity and the utility of the Gameplay Scale to industry

has been demonstrated by the scale ranking games according to their review scores.

The final task is to discuss the wider implications of these findings.

58
5 General Discussion

Through study 1 an instrument was developed to measure two major aspects

of player experience, as well as the usability and playability factors of the game,

revealing that all except the usability factors contributed to a game’s initial appeal.

Study 2 has shown that the Gameplay Scale instrument should have predictive power

when it comes to anticipating the quality and appeal of video games across different

genres, although reference to the individual subscales reveals that it may not have

behaved in this way for the expected reasons. In short, the scale developed thus far

should be considered a proof of principle; that, as Hassenzahl et al (2000) found for

productivity software, both hedonic and ergonomic factors influence a video game’s

appeal and that it is appropriate to measure these factors in the manner described by

this study. Whilst the ergonomic usability and playability factors generally were less

important in determining initial appeal, there are good reasons to believe that this was

only the case for initial appeal, an that longer gaming sessions increase the importance

of these factors.

It perhaps goes against the heuristics examined (in Febretti and Garzotto,

2009; Desurvire and Wiberg, 2009; Pinelle, Wong and Stach, 2008; Federoff, 2002

and Korhonen and Koivisto, 2006) to find that usability and playability factors were

less important in determining a game’s initial appeal than expected. However, these

factors were still important to some degree, whilst it is maintained that longer testing

sessions will highlight their importance. In terms of the player experience factors,

considering challenge as associated with cognitive absorption instead of immersion

(contra Ermi and Mäyrä, 2005) seems to have been the correct choice given the

clustering, even if cognitive absorption (included as per Jennett et al, 2008) was the

59
dominant factor in the Focus subscale. Nevertheless, the Focus subscale should act as

a measure of flow that can also measure suboptimal experiences, improving on how

many existing scales (e.g. IJsselsteijn et al, in preparation) chose to measure flow.

The Focus subscale being considerably longer (9 items as opposed to 5 or 6

for the other subscales) was an artifact of the process used to whittle the scale down

from 49 questions, but has important implications. If we accept the success of the

entire scale in ranking the games according to review score as a reason to accept the

Focus subscale’s larger size, it suggests that the Focus factors are of greater

importance in determining a game’s reviews. Whilst Affective Experience factors

generally correlated more strongly with initial appeal, the Hulk reviews (as an

anecdotal example) did note Focus issues as determining the game’s quality and

appeal, so long term play might value Focus factors more highly. It is interesting that

none of the models in the previous literature (save perhaps for Chen, 2007) noted

Focus constructs as being more important than other player experience factors. It is

possible that not all constituents of the gameplay experience are equally important,

but it is likely that this varies from game to game and genre to genre.

Finally, the Affective Experience scale measures especially important aspects

of the game experience (judging by its correlation to the Appeal Scale) and so the

models of Ermi and Mäyrä (2005), Brown and Cairns (2005) and Malone (1981) were

highly successful in capturing this rich phenomenology through the Gameplay Scale.

There are also other, quite different approaches to the video game experience

that were not covered in the literature review. Whilst the current study summarised

much of the HCI literature, Ryan et al (2006) and Rigby and Ryan (2007) examined

player experience factors in a very different way. Their Player Experience of Need

Satisfactions (PENS) model was founded upon motivational psychology, and included

60
factors such as the need for player feelings of autonomy and competence in addition

to more familiar constructs such as presence and intuitive controls. This model was

also found to discriminate between games with high and low metascores (although it

could be argued that almost any model could distinguish between a game rated 56.6%

and a game rated 97.8%, as Ryan et al (2006) were able to) using a scale, and so

future work should also consider including such motivational factors.

As noted though, this should be considered an exploratory study. The sample

size was small (n = 98 for study 1, n = 17 for study 2) preventing a full factor analysis

from being performed in study 1 and limiting the strength of the conclusions in study

2. Indeed, the sample size of study 2 fell below Tullis and Stetson’s (2004)

recommended minimum of 10 per group when using such scales. Moreover, further

analysis is required to establish what a high score on the scale is and what is a low

score – and thus establish benchmarking. Not only that, but some of the clustering of

the questions is perhaps suspect (especially the empathy and camera questions being

in the Focus subscale, which was quite unexpected), whilst – against the received

wisdom on scale design – moving the general questions to the end may encourage

players to base their judgements on a all of the elements referred to in the preceding

specific questions.

The generalisability from these results can also be questioned. Only three

games were studied, whereas previous scales were built upon the experiences of

players across many games (e.g. Jennett et al, 2008). Yet throughout the two studies,

such generalisability was sacrificed to increase the internal validity of the results. By

only having players (recruited from a console manufacturer’s official forum, a

reasonably reliable source for a web survey) play one game in study 1 the error

introduced by recruiting any players from a score of third party websites was reduced.

61
Moreover, controlling the sort of games (and sort of player) in study 2 allowed for

direct comparison between two games even if each player only played one. Of course,

repeated measures would have been desirable here, but ensuring that players played

the game for long enough (i.e. at least an hour; even longer would have been

preferable) prevented this – and resources were not available to study longer sessions

involving multiple games or use greater triangulation.

The choice of game is then perhaps suspect – ones that could have been better

rated by the players in an hour may have been a better choice, although the

remarkable similarity between the two titles did help to ensure a reasonable level of

control. Finally, it is recognised that metascores, and the opinions of reviewers, are

not an ideal external measure of game quality, especially given repeated allegations of

games publishers tampering with review scores (e.g. Plunkett, 10 July 2009).

Measures of player arousal - such as galvanic skin response or facial

electromyography (Nacke and Lindley, 2008) may have been preferable.

Nevertheless, metascores are held to be highly important in the industry, with

anecdotal evidence suggesting that many publishers can correlate the metascores of

games to the sales of said games (Stuart, 17 January 2008,). This is why metascores

were chosen – by being be able to predict metascores when assessing in-development

games, the Gameplay Scale may predict the sales of a game and suggests areas for

improvement that should increase the appeal of the game and thus its sales. By doing

this, the Gameplay Scale (or at least, the rationale behind it) is potentially very useful

to the video game industry.

In addition to overcoming the weaknesses already mentioned, future studies

will need to expand such scales to cover multiple genres; even the core constructs that

this scale represents will not cover every game genre; e.g. Massive-Multiplayer

62
Roleplaying Games (MMORPG) have social factors that it is very important to

measure. One solution to this is to add ‘modules’ to the scale; extra (validated)

subscales that cover elements that are important to each genre. For instance, we might

add a ‘social factors’ subscale when assessing MMORPGs. We also need to study

more types of players, the current study assessing only dedicated or ‘core’ gamers; if

we want to cover more casual gamers, we might add an ‘approachability subscale’, as

per Desurvire and Wiberg (2008). In addition, whilst the androcentric gender bias

mirrors that of core gamers, more female participants will be required to study casual

gamers.

Nevertheless, despite the limitations noted the study has achieved its ultimate

aim – to establish that player experience, usability and playability factors are all

important in player perception of video game quality, and then to create a scale to

measure these characteristics during player testing. Moreover, by drawing on previous

questionnaires and the questionnaire design literature, the issues documented by

Hornbaek (2006) were largely avoided by increasing the rigour involved in the scale’s

design.

The Gameplay Scale, and the reasoning behind it, therefore has utility to the

games industry due its predictive strength. Whilst the review scores of the games

tested were already known, there is no reason (once benchmarking is established) that

the Gameplay Scale couldn’t be used to predict the likely review score of a game.

Combined with the usual uses of such scales in player testing (i.e. to add attitudinal

data or locate important areas in a dataset), a scale that measures both hedonic and

ergonomic factors is very useful indeed.

63
6 Conclusions

Many studies on video game experience end with a complex diagram, showing

the interrelations between the analysed constructs. This one does not, as it is

recognised that player experiences are so varied that such formulations of how

‘experience x leads to experience y or is a species of experience z’ fail to capture this

variety. Instead, the key message from this study is that the appeal of video games

does have many components, but that the relations between these constructs can vary

depending on the game. What is more certain, however, is that whatever the enjoyable

experience of a particular game entails this enjoyment cannot be fully realised if

barriers are in place that prevent the user from engaging in play. Some of these

barriers are deeply embedded in the experience (such as the quality of the characters

or fundamental game mechanics) and may be difficult to improve; other however, are

not.

These are the playability and usability factors measured by this study, which

were found to be important predictors of review scores and perceived appeal. Those

involved in player testing are thus recommended to measure both hedonic and

ergonomic factors. Whilst the current study focused only on initial appeal, this is still

of importance in player testing. Both Pagulayan et al (2003) and Fulton (2002) note

the need to optimise the player experience from 1 minute to 10 minutes to 1 hour to

10 hours, whilst the proliferation of easily downloadable demos means that initial

gameplay experiences are much more closely tied to purchase choices than before.

This thesis has thus shown that player experience, usability and playability all

contribute to a game’s initial appeal and devised a scale to measure all three of these

factors, neither of which had been done before. Moreover, scores derived using the

64
scale correlate to the review scores of games, suggesting that the scale may have

practical applications in industry player testing. Although the scale designed in this

study may only make a modest contribution to player testing protocols, this could

nevertheless be a useful contribution, allowing us to quantify the degree to which the

game presents barriers to player enjoyment. Again, players choose to play games, and

unless we remove or reduce such barriers they can always put the controller down and

do something else.

65
References

Abran, A., Khelifi, A., Suryn, W., and Seffah, A. (2003). Usability meanings and
interpretations in ISO standards. Software Quality Journal, 11(4):325–338.

Agarwal, R. and Karahanna, E. (2000). Time flies when you’re having fun: Cognitive
absorption and beliefs about information technology usage. MIS Quarterly,
24(4):665–694.

Alwin, D. F. and Krosnick, J. A. (1991). The reliability of survey attitude


measurement: The influence of question and respondent attributes.
Sociological Methods Research, 20(1):139–181.

Arsenault, D. (2005). Dark waters: Spotlight on immersion. In Game On North


America 2005 Conference Proceedings, pages 50–52.

Bangor, A., Kortum, P. T., and Miller, J. T. (2008). An empirical evaluation of the
system usability scale. International Journal of Human-Computer Interaction,
24(6):574–594.

Bergstrom, B. A. and Lunz, M. E. (1998). Rating scale analysis: Gauging the impact
of positively and negatively worded items. In Annual Meeting of the American
Educational Research Association. April 13-17 1998; San Diego, CA

Brooke, J. (1996). SUS: A quick and dirty usability scale. In Jordan, P. W., Thomas,
B., Weerdmeester, B. A., and McClelland, I. L., editors, Usability Evaluation
in Industry. Taylor & Francis., London.

Brown, E. and Cairns, P. (2004). A grounded investigation of game immersion. In


CHI ’04: CHI ’04 Extended Abstracts On Human Factors In Computing
Systems, pages 1297–1300, New York, NY, USA. ACM Press.

Cairns, P. and Cox, A. L. (2008). Research Methods for Human-Computer


Interaction. Cambridge University Press, New York, NY, USA.

Carifio, J. and Perla, R. J. (2007). Ten common misunderstandings, misconceptions,


persistent myths and urban legends about likert scales and likert response
formats and their antidotes. Journal of Social Sciences, 3(3):106-116.

Carifio, J. and Perla, R. (2008). Resolving the 50-year debate around using and
misusing likert scales. Medical Education, 42(12):1150–1152.

Chen, J. (2007). Flow in games (and everything else). Commun. ACM, 50(4):31–34.

Chen, M. and Johnson, S. (2004). Measuring flow in a computer game simulating a


foreign language environment. Retrieved 17 June 2009 from
http://markdangerchen.net/papers/

66
Chin, J. P., Diehl, V. A., and Norman, K. L. (1988). Development of an instrument
measuring user satisfaction of the human-computer interface. In CHI '88:
Proceedings of the SIGCHI Conference On Human Factors In Computing
Systems, pages 213-218, New York, NY, USA. ACM Press.

Colman, A. M., Norris, C. E., and Preston, C. C. (1997). Comparing rating scales of
different lengths: Equivalence of scores from 5-point and 7-point scales.
Psychological Reports, (80):355–362.

Cowley, B., Charles, D., Black, M., and Hickey, R. (2008). Toward an understanding
of flow in video games. Comput. Entertain., 6(2):1–27.

Csikszentmihalyi, M. (1975) Flow: The Psychology of Optimal Experience Harper


Perennial; New York.

Csikszentmihalyi, M. (2000). Beyond Boredom and Anxiety: Experiencing Flow in


Work and Play. Jossey-Bass; New York.

Dawes, J. (2001). Comparing data gathered using 5 point versus 11 point scales. In
Australian & New Zealand Marketing Academy Conference. Massey
University.

Desurvire, H., Caplan, M., and Toth, J. A. (2004). Using heuristics to evaluate the
playability of games. In CHI ’04: CHI ’04 extended abstracts on Human
factors in computing systems, pages 1509–1512, New York, NY, USA. ACM
Press.

Desurvire, H. and Wiberg, C. (2008). Master of the game: assessing approachability


in future game design. In CHI ’08: CHI ’08 Extended Abstracts On Human
Factors In Computing Systems, pages 3177–3182, New York, NY, USA.
ACM.

Desurvire, H. and Wiberg, C. (2009). Game usability heuristics (play) for evaluating
and designing better games: The next iteration. Lecture Notes in Computer
Science, 56(21); 557-556

Dillman, D. A., Tortora, R. D., and Bowker, D. (1998). Principles for Constructing
Web Surveys. Technical report, SESRC, Washington.

Dillon, A. (2001). Beyond usability: process, outcome and affect in human computer
interactions. Canadian Journal of Information Science, 26(4):57-69.

Douglas, Y. and Hargadon, A. (2000). The pleasure principle: immersion,


engagement, flow. In HYPERTEXT ’00: Proceedings of the Eleventh ACM on
Hypertext and Hypermedia, pages 153–160. ACM Press.

Eastin, M. S. and Griffiths, R. P. (2006). Beyond the shooter game: Examining


presence and hostile outcomes among male game players. Communication
Research, 33(6):448–466.

67
Ermi, L. and Mäyrä, F. (2005). Fundamental components of the gameplay experience.
In de Castell, S. and Jenson, J., editors, Changing Views: Worlds in Play.
Selected papers of the 2005 Digital Games Research Association’s (DiGRA)
Second International Conference, pages 15–27.

Everitt, B. S. (1975). Multivariate analysis: the need for data, and other problems. The
British Journal of Psychiatry, 126(3):237–240.

Febretti, A. and Garzotto, F. (2009). Usability, playability, and long-term engagement


in computer games. In CHI EA ’09: Proceedings Of The 27th International
Conference Extended Abstracts On Human Factors In Computing Systems,
pages 4063–4068, New York, NY, USA. ACM.

Federoff, M. A. (2002). Heuristics And Usability Guidelines For The Creation And
Evaluation Of Fun In Video Games. Master’s thesis, Indiana University

Foddy, W. (1993). Constructing Questions for Interviews and Questionnaires: Theory


and Practice in Social Research. Cambridge University Press, Cambridge.

Fowler, F. J. (2001). Survey Research Methods (Applied Social Research Methods).


Sage Publications, Inc.

Fulton, B. (2002). Beyond psychological theory: getting data that improves games, in
Proceedings of the Game Developer Conference. (San Jose, CA, March 2002).

Ganassali, S. (2008). The influence of the design of web survey questionnaires on the
quality of responses. Survey Research Methods, 2(1):21–32.

Garland, R. (1991). The mid-point on a rating scale: Is it desirable? Marketing


Bulletin, (2):66–70.

Gilleade, K. M. and Dix, A. (2004). Using frustration in the design of adaptive


videogames. In ACE ’04: Proceedings of the 2004 ACM SIGCHI International
Conference on Advances In Computer Entertainment Technology, pages 228–
232, New York, NY, USA. ACM.

Gilljam, M. and Granberg, D. (1993). Should we take don’t know for an answer?
Public Opin Q, 57(3):348–357.

Goldstein, H. (09 June 2008). The incredible hulk review. In IGN. Retrieved 08
August 2009 from http://uk.ps3.ign.com/articles/880/880381p1.html

Gray, W. D. and Salzman, M. C. (1998). Damaged merchandise? a review of


experiments that compare usability evaluation methods. Human-Computer
Interaction, 13(3):203–261.

Grodal, T. (2000) Video games and the pleasures of control. In D. Zillmann and
P . Vorderer (eds) Media Entertainment, pp. 197–212. Mahwah, NJ: Erlbaum

Guilford, J. P. (1954). Psychometric Methods. McGraw-Hill, New York.

68
Hartley, J. and Betts, L. R. (2009). Four layouts and a finding: the effects of changes
in the order of the verbal labels and numerical values on likert-type scales.
International Journal of Social Research Methodology, 99(1):1–11.

Hassenzahl, M., Platz, A., Burmester, M., and Lehner, K. (2000). Hedonic and
ergonomic quality aspects determine a software’s appeal. In CHI ’00:
Proceedings of the SIGCHI Conference On Human Factors In Computing
Systems, pages 201–208, New York, NY, USA. ACM.

Hinz, A., Michalski, D., Schwarz, R., and Herzberg, P. Y. (2007). The acquiescence
effect in responding to a questionnaire. GMS Psycho-Social-Medicine, 4.

Hopson, J. (10 November, 2006) We're Not Listening: An Open Letter to Academic
Game Researchers. In Gamasutra, Retrieved 21/August/2009 from
http://www.gamasutra.com/features/20061110/hopson_01.shtml

Hornbaek, K. (2006). Current practice in measuring usability: Challenges to usability


studies and research. International Journal of Human-Computer Studies,
64(2):79–102.

Huizinga, J. (1998) Homo Ludens. (R.F.C. Hull, Trans.) Routledge; New York
(Original work published 1938)

IJsselsteijn, W., de Kort, Y., Poels, K., Jurgelionis, A., and Bellotti, F. (2007).
Characterising and measuring user experiences in digital games. In
International Conference on Advances in Computer Entertainment.

IJsselsteijn, W. A., de Kort, Y. A. W., and Poels, K. (in preparation). The Game
Experience Questionnaire: Development of a Self-Report Measure to Assess
the Psychological Impact of Digital Games. Manuscript in preparation.

ISO 9241-11 (1998) Ergonomic Requirements for Office Work With Visual Display
Terminals (VDTs) – Part 11: Guidance on Usability

Jacoby, J. and Matell, M. S. (1971). Three-point likert scales are good enough.
Journal of Marketing Research, 8(4):495–500.

Jennett, C., Cox, A., and Cairns, P. (2009). Being ’in the game’. In Gunzel, S., Liebe,
M., and Mersch, D., editors, Proc. of the Philosophy of Computer Games
2008,, pages 210–227. Potsdam University Press,.

Jennett, C., Cox, A., Cairns, P., Dhoparee, S., Epps, A., Tijs, T., and Walton, A.
(2008). Measuring and defining the experience of immersion in games.
International Journal of Human-Computer Studies, 66(9):641–661.

Jensen, M. P. (2003). Questionnaire validation: a brief guide for readers of the


research literature. The Clinical Journal Of Pain, 19(6):345–352.

69
Kim, J. H., Gunn, D. V., Schuh, E., Phillips, B., Pagulayan, R. J., and Wixon, D.
(2008). Tracking real-time user experience (TRUE): a comprehensive
instrumentation solution for complex systems. In CHI ’08: Proceeding Of The
Twenty-Sixth Annual SIGCHI Conference On Human Factors In Computing
Systems, pages 443–452, New York, NY, USA. ACM.

Kim, W. W. (2006). Engagement, Body Movement and Emotions in Games:


Relationships And Measurements. Master’s thesis, University College London.

Kline, P. (1998). New Psychometrics: Science, Psychology and Measurement.


Routledge.

Korhonen, H. and Koivisto, E. M. I. (2006). Playability heuristics for mobile games.


In MobileHCI ’06: Proceedings Of The 8th Conference On Human-Computer
Interaction With Mobile Devices And Services, pages 9–16, New York, NY,
USA. ACM.

Korhonen, H. and Koivisto, E. M. I. (2007). Playability heuristics for mobile multi-


player games. In DIMEA ’07: Proceedings Of The 2nd International
Conference On Digital Interactive Media In Entertainment And Arts, pages
28–35, New York, NY, USA. ACM.

Krosnick, J. A. (1999). Survey research. Annual Review of Psychology, 50(1):537–


567.

Laitinen, S (June 23 2005) Better games through usability evaluation and testing. In
Gamasutra. Retrieved 10 June 2009 from
http://www.gamasutra.com/features/20050623/laitinen_01.shtml

Lazzaro, N. (2004). Why we play games: Four keys to more emotion in player
experiences. In Proceedings of the Game Developers Conference.

Lehmann, D. R. and Hulbert, J. Are three-point scales always good enough? Journal
of Marketing Research, 9(4):444–446.

Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires: psychometric


evaluation and instructions for use. Int. J. Hum.-Comput. Interact., 7(1):57–
78.

Lietz, P. (2008) Questionnaire design in attitude and opinion research: Current state
of an art. Technical report number: FOR 655, Jacobs University Bremen.

Lindley, C. and Nacke, L. (2008). Boredom, immersion, flow - a pilot study


investigating player experience. In IADIS Gaming 2008: Design for Engaging
Experience and Social Interaction,. IADIS.

70
Lindley, S. E., Le Couteur, J., and Berthouze, N. L. (2008). Stirring up experience
through movement in game play: effects on engagement and social behaviour.
In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on
Human factors in computing systems, pages 511–514, New York, NY, USA.
ACM.

Malone, T. W. (1981). Toward a theory of intrinsically motivating instruction.


Cognitive Science, 5(4):333–369.

Martin, E. (2006). Survey questionnaire construction. Technical report, US Census


Bureau.

Massimini, F. and Massimo, C. (1988). The systematic assessment of flow in daily


experience. In Csikszentmihalyi, M. and Csikszentmihalyi, I., editors, Optimal
Experience, pages 288–306. Cambridge University Press, New York.

Masters, J. R. (1974). The relationship between number of response categories and


reliability of likert-type questionnaires. Journal of Educational Measurement,
11(1):49–53.

McGarvey, S. (17 June 2008). Reviews: The incredible hulk. In GameSpy. Retrieved
08 August 2009 from http://uk.ps3.gamespy.com/playstation-3/the-incredible-
hulk-the-movie/882516p1.html

McGuire, W. J. (1960). Cognitive consistency and attitude change. Journal of


Abnormal and Social Psychology, 60:345–353.

McMahan, A. (2003). Immersion, engagement, and presence: A method for analyzing


3d videogames. In Wolf, M. J. P. and Perron, B., editors, The Video Game
Theory Reader, pages 67–86. Routledge, New York.

Miller, G. A. (1956). The magical number seven plus or minus two: some limits on
our capacity for processing information. Psychol Rev, 63(2):81–97.

Nacke, L. and Lindley, C. A. (2008). Flow and immersion in first-person shooters:


measuring the player’s gameplay experience. In Future Play ’08: Proceedings
of the 2008 Conference on Future Play, pages 81–88, New York, NY, USA.
ACM.

Nielsen, J. and Molich, R. (1990). Heuristic evaluation of user interfaces. In CHI ’90:
Proceedings Of The SIGCHI Conference On Human Factors In Computing
Systems, pages 249–256, New York, NY, USA. ACM Press.

Oppenheim (1992). Questionnaire Design, Interviewing & Attitude Measurement.


Pinter, London.

Pallant, J. (2001) SPSS Survival Manual. Open University Press, Maidenhead, UK.

71
Pagulayan, R. J., Keeker, K., Wixon, D., Romero, R. L., and Fuller, T. (2003). User-
centered design in games. In J.A. Jacko and A. Sears (Eds.) The Human-
Computer Interaction Handbook: Fundamentals, Evolving Technologies and
Emerging Applications. Lawrence Erlbaum Associates; New York, pages
883–906.

Peytchev, A., Couper, M. P., McCabe, E. S., Crawford, S.D. (2006). Web survey
design. Public Opinion Quarterly, 70(4):596–607.

Pinchbeck, D. (2005). Is presence a relevant or useful construct in designing game


environments? In Proceedings of the 2nd Annual International Workshop in
Computer Game Design and Technology. Liverpool John Moores University.

Pinelle, D. and Wong, N. (2008). Heuristic evaluation for games: usability principles
for video game design. In Chi ’08: Proceeding Of The Twenty-Sixth Annual
SIGCHI Conference On Human Factors In Computing Systems, pages 1453–
1462, New York, NY, USA. ACM.

Plunkett, L. (10 July 2009). Eidos once again attempting to mess with review scores?
In Kotaku. Retrieved 08 August 2009, from
http://kotaku.com/5311606/%5Bupdate%5D-eidos-once-again-attempting-to-
mess-with-review-scores.

Polson, P. G., Lewis, C., Rieman, J., and Wharton, C. (1992). Cognitive
walkthroughs: a method for theory-based evaluation of user interfaces. Int. J.
Man-Mach. Stud., 36(5):741–773.

Preston, C. and Colman, A. M. (2000). Optimal number of response categories in


rating scales: reliability, validity, discriminating power, and respondent
preferences. Acta Psychologica, 104(1):1-15.

Rigby, S. and Ryan, R. (2007). The Player Experience Of Need Satisfaction


(Pens):An Applied Model And Methodology For Understanding Key Compone
nts Of The Player Experience. Technical report, Immersyve, Celebration, FL.

Ryan, R. M., Rigby, C. and Przybylski, A. (2006). The motivational pull of video
games: A self-determination theory approach. Motivation and Emotion,
30(4):344–360.

Sauro, J. and Lewis, J. R. (2009). Correlations among prototypical usability metrics:


evidence for the construct of usability. In Chi ’09: Proceedings Of The 27th
International Conference On Human Factors In Computing Systems, pages
1609–1618, New York, NY, USA. ACM.

Schaeffer, N. C. and Presser, S. (2003). The science of asking questions. Annual


Review of Sociology, 29:65–88.

Scheibler, D. and Schneider, W. (1985). Monte Carlo tests of the accuracy of cluster
analysis algorithms: A comparison of hierarchical and nonhierarchical
methods. Multivariate Behavioral Research, 20(3):283–304.

72
Shelley, B. (15 August 2001). Guidelines for developing successful games.
Gamasutra. Retrieved 15 June 2009, from
http://www.gamasutra.com/features/20010815/shelley_01.htm

Slater, M. (2004). How colorful was your day? why questionnaires cannot assess
presence in virtual environments. Presence: Teleoper. Virtual Environ.,
13(4):484–493.

Stevens, J. (2002). Applied Multivariate Statistics For The Social Sciences. Lawrence
Erlbaum Associates, Philadelphia.

Stuart, K. (17 January 2008). Interview: the science and art of metacritic. In The
Guardian Gamesblog. Retrieved 08 August 2009 from
http://www.guardian.co.uk/technology/gamesblog/2008/jan/17/interviewtheart
ofmetacriti

Sweetser, P. and Wyeth, P. (2005). GameFlow: a model for evaluating player


enjoyment in games. Comput. Entertain., 3(3):3.

Takatalo, J., Hokkinen, J., Komulainen, J., Sarkela;, H., and Nyman, G. (2006).
Involvement and presence in digital gaming. In NordiCHI ’06: Proceedings of
the 4th Nordic conference on Human-computer interaction, pages 393–396,
New York, NY, USA. ACM Press.

Thorndike, R. M. (1978). Correlational Procedures for Research. Gardner Press,


New York.

Tourangeau, R. (1999). Context effects on answers to attitude questions. In Sirken, M.


G., Herrmann, D. J., Schechter, S., Schwarz, N., Tanur, J. M., and
Tourangeau, R., editors, Cognition and Survey Research, pages 111-132.
Wiley, New York.

Tractinsky, N., Katz, A., and Ikar, D. (2000). What is beautiful is usable. Interacting
with Computers, 13(2):127–145.

Tullis, T. S. and Stetson, J. N. (2004). A comparison of questionnaires for assessing


website usability. In Proceedings of the Usability Professionals Association
Conference, Minneapolis, MN: UPA.

Velez, P. and Ashworth, S. D. (2007). The impact of item readability on the


endorsement of the midpoint response in surveys. Survey Research Methods,
1(2):69–74.

Wickramasekera, I. E. (2007). Empathic features of absorption and incongruence. The


American Journal of Clinical Hypnosis, 50(1):59–69.

Witmer, B. G. and Singer, M. J. (1998). Measuring presence in virtual environments:


A presence questionnaire. Presence: Teleoper. Virtual Environ., 7(3):225–
240.

73
Zagalo, N., Torres, A., and Branco, V. (2005). Emotional spectrum developed by
virtual storytelling. In Virtual Storytelling, pages 105-114.

74
Appendix A

The Initial Gameplay Scale

1.) Information provided to participants before beginning questionnaire:

Participant Information - Read This

We would like to invite you to participate in this research project. Before


you decide whether you want to take part, it is important for you to read
the following information carefully and discuss it with others if you wish.

● The purpose of this study is to investigate player


experience of video games using a questionnaire. You will
have been asked to play a game for up to 2 hours.

● You are now asked to complete the following survey, answering


each of the statements as truthfully as possible. The survey should
take around 10 minutes to complete.

● This study is being performed by University College London (UCL)


with the cooperation of Sony Computer Entertainment Europe (SCEE);
please note that whilst SCEE will have to access to the results of the
study they will not have access to any private details for use in
marketing, etc. All data will be collected and stored in accordance
with the Data Protection Act 1998.

● If you decide to take part you are still free to withdraw at any
time and without giving a reason.

● Ask us if there is anything that is not clear or you would like more
information. If you do have any questions, please contact the
researcher for this study, Mark Parnell, at mjparnell@gmail.com or
the supervisor for the research, Dr. Nadia Bianchi-Berthouze, at
n.berthouze@ucl.ac.uk

HEALTH WARNING

Always play in a well lit environment. Take regular breaks, 15 minutes every
hour. Discontinue playing if you experience dizziness, nausea, fatigue or have
a headache. Some individuals are sensitive to flashing or flickering lights or
geometric shapes and patterns, may have an undetected epileptic condition
and may experience epileptic seizures when watching television or playing
videogames. Consult your doctor before playing videogames if you have an
epileptic condition and immediately should you experience any of the
following symptoms whilst playing: altered vision, muscle twitching, other
involuntary movement, loss of awareness, confusion and/or convulsions.

75
2.) Example of appearance of question with response item in both the initial and
revised Gameplay Scales:

Figure A.1 Example response item from the initial Gameplay Scale.

3.) Full list of questions in the initial Gameplay Scale, in correct order:

Q1 I enjoyed the game.


Q2 I felt the game was hard.
Q3 I was focused on the game.
Q4 I could identify with the characters.
Q5 I thought that the game was fun.
Q6 I found the game boring.
Q7 I liked how the game looked.
Q8 The game trained me in all of the controls.
Q9 I thought that the game was repetitive.
Q10 I found the game to be easy.
Q11 Playing the game made me happy.
Q12 I knew how to use the controller with the game.
Q13 I felt like events in the game were happening to me.
Q14 I found the game's menus to be usable.
Q15 I always knew where to go in the game.
Q16 I knew how to change the settings in the game.
Q17 It felt like I was responsible for what happened in the game.
Q18 Moving my point of view in the game was easy.
Q19 The game would provide help at appropriate moments.
Q20 I was unaware of the passage of time whilst playing.
Q21 I found the appearance of the game world to be interesting.
Q22 I found the controls to be difficult.
Q23 I forgot about my surroundings whilst playing.
Q24 I found using the options screen to be difficult.
Q25 I thought the controls were intuitive.
Q26 I concentrated on sounds in the game.
Q27 I knew how the game would respond to my actions.
Q28 I always knew how to achieve my aim in the game.
Q29 I thought the game mechanics were consistent.
Q30 My objectives in the game were unclear.
Q31 The game provided me with an adequate tutorial.
Q32 I lost my direction through the game.
Q33 I knew when my goal in the game had changed.

76
Q34 I thought that the game's menus were intuitive.
Q35 I felt that the game provided enough variety.
Q36 I found the game's menus to be cumbersome.
Q37 I couldn't find my way in the game world.
Q38 I found the game mechanics to be varied enough.
Q39 I found the game's story to be dull.
Q40 I played by my own rules in the game.
Q41 I thought about things other than the game whilst playing.
Q42 My field of view made it difficult to see what was happening in the game.
Q43 The aesthetics of the game were unimpressive.
Q44 I thought the camera angles in the game were appropriate.
Q45 The game failed to motivate me to keep playing.
Q46 The game responded to my inputs in an inconsistent way.
Q47 I wanted to explore the game world.
Q48 I thought the level of difficulty was right for me.
Q49 I knew how to customize the way that the game was set up.

Demographic Questions (asked at the very end):

- How old are you?

- What is your gender?

- Is English your first language?

- Thank you for completing the questionnaire. If you have any further
comments about the content of the questionnaire - or what it lacked -
please provide them below.

77
Appendix B

Results of Cluster Analysis on the Initial Gameplay Questionnaire

Table B.1. Cluster Membership of Each Gameplay Scale Item.


Construct Question Question Cluster
Number
Affect I enjoyed the game. Q1 AE
Affect Playing the game made me happy. Q11 AE
Sensory I found the appearance of the game world to be interesting. Q21 AE
Controls I thought the controls were intuitive. Q25 AE
Sensory I concentrated on sounds in the game. Q26 AE
Menu I thought that the game's menus were intuitive. Q34 AE
Variety I felt that the game provided enough variety. Q35 AE
Fictional I found the game's story to be dull. Q39 AE
Sensory The aesthetics of the game were unimpressive. Q43 AE
Ownership The game failed to motivate me to keep playing. Q45 AE
Fictional I wanted to explore the game world. Q47 AE
Affect I thought that the game was fun. Q5 AE
Affect I found the game boring. Q6 AE
Sensory I liked how the game looked. Q7 AE
Challenge I found the game to be easy. Q10 F
Fictional I felt like events in the game were happening to me. Q13 F
Ownership It felt like I was responsible for what happened in the game. Q17 F
Camera Moving my point of view in the game was easy. Q18 F
Absorption I was unaware of the passage of time whilst playing. Q20 F
Absorption I forgot about my surroundings whilst playing. Q23 F
Absorption I was focused on the game. Q3 F
Variety I found the game mechanics to be varied enough. Q38 F
Fictional I could identify with the characters. Q4 F
Ownership I played by my own rules in the game. Q40 F
Absorption I thought about things other than the game whilst playing. Q41 F
Camera My field of view made it difficult to see what was happening in the game. Q42 F
Camera I thought the camera angles in the game were appropriate. Q44 F
Challenge I thought the level of difficulty was right for me. Q48 F
Navigation I always knew where to go in the game. Q15 PB
Consistency I knew how the game would respond to my actions. Q27 PB
Goals I always knew how to achieve my aim in the game. Q28 PB
Consistency I thought the game mechanics were consistent. Q29 PB
Goals I always knew how to achieve my aim in the game. Q30 PB
Navigation I lost my direction through the game. Q32 PB
Goals I knew when my goal in the game had changed. Q33 PB
Navigation I couldn't find my way in the game world. Q37 PB
Controls I knew how to use the controller with the game. Q12 UB
Menus I found the game's menus to be usable. Q14 UB
Settings I knew how to change the settings in the game. Q16 UB
Help The game would provide help at appropriate moments. Q19 UB
Challenge I felt the game was hard. Q2 UB
Controls I found the controls to be difficult. Q22 UB

78
Table B.1. Cluster Membership of Each Gameplay Scale Item (Cont’d)
Construct Question Question Cluster
Number
Settings I found using the options screen to be difficult. Q24 UB
Help The game provided me with an adequate tutorial. Q31 UB
Menus I found the game's menus to be cumbersome. Q36 UB
Consistency The game responded to my inputs in an inconsistent way. Q46 UB
Settings I knew how to customize the way that the game was set up. Q49 UB
Help The game trained me in all of the controls. Q8 UB
Variety I thought that the game was repetitive. Q9 UB
(“AE” = Affective Experience; “F” = Focus; “PB” = Playability Barriers; “UB” = Usability
Barriers)

79
Appendix C

The Revised Gameplay Scale

Full List of Questions in the Revised Gameplay Scale, in Correct Order

Q1 I enjoyed the game.


Q2 I was focused on the game.
Q3 I could identify with the characters.
Q4 I thought that the game was fun.
Q5 The game trained me in all of the controls.
Q6 I thought the level of difficulty was right for me.
Q7 I found the game's menus to be usable.
Q8 I knew how to use the controller with the game.
Q9 I was unaware of the passage of time whilst playing.
Q10 I found the appearance of the game world to be interesting.
Q11 I knew how to change the settings in the game.
Q12 My objectives in the game were unclear.
Q13 I thought about things other than the game whilst playing.
Q14 I knew how the game would respond to my actions.
Q15 I couldn't find my way in the game world.
Q16 I always knew how to achieve my aim in the game.
Q17 I found the game's menus to be cumbersome.
Q18 I found the game mechanics to be varied enough.
Q19 I forgot about my surroundings whilst playing.
Q20 My field of view made it difficult to see what was happening in the game.
Q21 I found using the options screen to be difficult.
Q22 The aesthetics of the game were unimpressive.
Q23 I thought the camera angles in the game were appropriate.
Q24 The game failed to motivate me to keep playing.
Q25 I always knew where to go in the game.
Q26 I wanted to explore the game world.

Demographic Questions (asked at the very end):

- How old are you?

- What is your gender?

- Is English your first language?

- What is your favourite video game genre?

80
Appendix D

The Appeal Scale

1.) Example of appearance of response items in the Appeal Scale:

Figure D.1 Example response item from the Appeal Scale.

2.) Full list of word-pairs in the Appeal Scale, in correct order:

1. unpleasant - pleasant

2. bad - good

3. unaesthetic - aesthetic

4. rejecting - inviting

5. unattractive - attractive

6. discouraging - motivating

7. undesirable - desirable

8. boring – fun

(Appeal Scale always administered after the Gameplay Scale)

81
Appendix E

Information Sheet and Consent Form used in Study #2

Information Sheet for Participants in Research Studies


You will be given a copy of this information sheet.

Playing with Scales: Creating a Measurement Scale to Assess Player Experience


Title of Project:
When Testing Video Games

This study has been approved by the UCL Research Ethics


Committee [Project ID Number]: XXXXX

Name, Address and Contact Details of Investigators: Nadia Bianchi-Berthouze (Lecturer)


UCL Interaction Centre (UCLIC)
University College London
Malet Place Engineering Building, 8th floor
Gover Street
London WC1E 6BT, UK
Email: n.berthouze@ucl.ac.uk
Tel: +44 (0)20 7679 0690 (internal: X30690)

Mark Parnell (MSc Student)


Oakdene
Farthings Hill
Horsham
West Sussex
RH12 1TS
Email: mjparnell@gmail.com
Tel: 07792812843

We would like to invite you to participate in this research project. You should only participate if you want
to; choosing not to take part will not disadvantage you in any way. Before you decide whether you want to
take part, it is important for you to read the following information carefully and discuss it with others if you
wish. Ask us if there is anything that is not clear or you would like more information.

The purpose of this study is to investigate player experience of videogames, using both a questionnaire and
an interview. You will be asked to play a game for one hour whilst being videotaped playing. You will then
be asked to complete a questionnaire, after which you will be given a short interview by the researcher,
during which the video will be viewed and discussed. The experiment will then end. This study is being
performed in conjunction with Sony Computer Entertainment Europe; please note that whilst they will
have to access to the results of the study they will not have access to any private details for use in
marketing etc.

It is up to you to decide whether or not to take part. If you choose not to participate it will involve no
penalty or loss of benefits to which you are otherwise entitled. If you decide to take part you will be given
this information sheet to keep and be asked to sign a consent form. If you decide to take part you are still
free to withdraw at any time and without giving a reason.

All data will be collected and stored in accordance with the Data Protection Act 1998.

82
Informed Consent Form for Participants in Research Studies

Playing with Scales: Creating a Measurement Scale to Assess Player Experience


Title of Project:
When Testing Video Games

This study has been approved by the UCL Research Ethics Committee
[Project ID Number]: XXXXX

Participant’s Statement

I …………………………………………......................................
agree that I have

 read the information sheet and/or the project has been explained to me orally;
 had the opportunity to ask questions and discuss the study;
 received satisfactory answers to all my questions or have been advised of an individual to
contact for answers to pertinent questions about the research and my rights as a participant
and whom to contact in the event of a research-related injury.
 understood that my participation will be taped/video recorded and I am aware of and consent
to, any use you intend to make of the recordings after the end of the project.
 read and understood the involvement of Sony Computer Entertainment Europe in this study.

I understand that I am free to withdraw from the study without penalty if I so wish and I consent to
the processing of my personal information for the purposes of this study only and that it will not be
used for any other purpose. I understand that such information will be treated as strictly
confidential and handled in accordance with the provisions of the Data Protection Act 1998.

Signed: Date:

I …………………………………………......................................
agree to the publishing of frames from my videos (in which my face will be blanked out) in academic
publications

Yes No

Investigator’s Statement

I ……………………………………………………………………..

confirm that I have carefully explained the purpose of the study to the participant and outlined any
reasonably foreseeable risks or benefits (where applicable).

Signed: Date:

83

You might also like