Keywords

1 The Problem with Usability Testing

Usability testing is the most commonly used method applied by User Experience (UX) professionals to evaluate the usability and the overall experience of a given interactive system [7]. Indeed, textbooks designed for introductory courses in UX, Human-Computer Interaction (HCI), Human-Centred Design (HCD), or similar almost certainly have a section or an entire chapter devoted to usability testing [e.g., 1, 3, 10, 17]. Despite such a focus on usability testing amongst and from UX professionals, several authors have noted variances amongst UX professionals concerning the outcomes of tests run on the same interface [e.g., 11121314]. Lindgaard and Chattratichart [11] had found that the number of tasks provided in testing was positively correlated to the percentage of problems found and of new problems. So, these findings suggest that some of the variances in UX professionals’ usability test results can be explained by how many tasks they cover in their testing. Nonetheless, there is one undeniable truth that can be stated given the results from the CUE studies made by Molich and colleagues [e.g., 12, 13, 14]; As Molich et al. stated [14] the, “simple assumption that we are all doing the same and getting the same results in a usability test is plain wrong.” In the present paper, we present a thesis to explain alternative or additional reasons that may explain the variances found amongst UX professionals’ usability testing results.

Addressing the variance of usability testing results amongst UX professionals is important because arguably it impacts the perception of those who use UX professional services. Simply put, if stakeholders cannot receive reliable services and findings, how are they to know what is a serious issue from a spurious result? Molich et al.’s CUE showing this variance of usability testing results have also gained substantial attention from the UX field as well, as evidenced by the over 500 citations gained for these papers [i.e., 12, 13, 14]. To address the issue of usability testing variance, we have introduced five potential problems with usability testing, which are supported by both academic evidence as well as some of our own experiences in industry. Recommendations are offered to address each problem. Nonetheless, each problem is introduced in hopes that future researchers will pick up the baton and further investigate the legitimacy and potential impact of each problem.

To provide some additional context, a brief history of usability testing will be reviewed prior to proceeding to identified problems. Finally, in terms of scope, focus is restricted to lab usability testing, and not remote variations of usability testing. This focus is in line with Molich et al.’s papers [e.g., 12, 13, 14], which has served as the impetus for the present paper.

1.1 A Brief History of Usability Testing

It is interesting to note that there are limited academic books and papers written on the history of Human-Computer Interaction (HCI), let alone the history of usability testing specifically [e.g., 15, 17, 19]. Most accounts of HCI in the academic literature generally attribute the origins of the field starting with Xerox PARC back in the late 1970s and early ‘80s [e.g., 15, 17]. Still, some online sources indicate the beginnings of HCI began in World War I and World War II [e.g., 19, 20]. Whitby [20] specifically discusses the role of evaluation in cockpit design in World War II, as human factors and psychology make their first attempts at scientifically evaluating and improving human performance and the interaction design for a given system (i.e., the cockpit). It is therefore argued that the beginnings of usability testing began during WWII during the 1940’s. If one is interested in a more complete historical review of HCI, then Norman [17] is recommended for in an in-depth history.

Early usability testing as described by Norman [17] and Whitby [20] suggests most testing from WWII and during the PARC era was validation testing in nature. Validation testing describes a specific type of usability testing where a given system or interface is tested with specific tasks, which are measured against predetermined usability standards or benchmarks [18]. Since those times, however, Rubin and Chisnell [18] have suggested that the most common type of usability testing among the four they have identified, is assessment testing. Assessment testing is described as a cross between more exploratory research, and a more objective and measured test similar to validation testing. Participants will perform tasks, as with validation testing, but performance is not assessed against predetermined benchmarks. Rubin and Chisnell suggest that quantitative measures are taken, but there is dialogue between participant and facilitator to explore any issues encountered by the participant. In sum, assessment testing lies somewhere along a continuum that describes how to carry out usability testing. It lies between validation testing at one extreme, and exploratory testing at the other, where no tasks are defined, and participants are asked to evaluate the design concept in an exploratory manner. Of note, the fourth and final type of usability testing Rubin and Chisnell identify is comparative usability testing, where researchers compare two systems or interfaces against each other [18]. Furthermore, comparative usability testing is carried out as part of any of the other three usability test types already mentioned above.

Rubin and Chisnell provide a good overview of the different usability testing methods [18, See Figure 1]. The popular assessment testing is problematic, however, because of how it lies on a continuum between two defined extreme forms of testing. Most UX professionals are familiar with usability testing as a method of research via introductory texts [e.g., 1, 3, 18]. The problem as evidenced by Molich’s research, however, is in the application of usability testing [e.g., 12, 13, 14]. It is argued that relatively few UX professionals practicing usability testing have been formally trained or familiarized themselves with the few texts that deal specifically with when to conduct a usability test, and how to facilitate it [e.g., 6, 18]. Therefore, the identified problems are issues that arguably occur as a result of a popular research method applied in UX, by professionals who may have a limited understanding of when and how to apply it (Fig. 1).

Fig. 1.
figure 1

Rubin and Chisnell’s [18] four identified types of usability testing.

2 Identified Problems

2.1 Usability Testing as the Wrong Method

Usability testing is recognized as a keystone and one of the principal methods to evaluate the design and usability of a system during development, before release [e.g., 1, 7, 18]. Arguably, it is used so predominantly and broadly that it is used in various situations or contexts where a different approach is more appropriate. In other words, this go-to method for UX professionals is applied too often and in the wrong situations. Too many UX professionals simply use this method as a matter of course, without thinking about: 1) where they are in the design process, 2) the design problem that is being addressed, 3) what information the team requires to move the project forward, 4) the research challenge they are trying to solve.

As early as 2008, Saul Greenberg and Bill Buxton also questioned the broad application of usability testing in their article titled, Usability Evaluation Considered Harmful (Some of the Time) [7]. In their influential paper, Greenberg and Buxton argue that while usability testing may be valuable in many contexts, it’s heavy push as a central method to use for evaluation in all situations limits our ability to pursue more explorative techniques that enrich our understanding of the problems we are trying to solve. Thereby enabling more creative solutions to these problems. Fundamentally, they argue that it is the responsibility of the researcher to thoughtfully consider the research questions and situation they are trying to address, and from that come up with appropriate method(s) that will effectively address these questions and situation. While CHI has embraced usability evaluation because of its utility in most research contexts, it is not always the right approach.

Eight years later, and Hertzum [9] further questioned how usability testing is applied during development and testing of a product or service. While Hertzum does not question the application of the methodology itself, he does question when and how it is best applied during the development process.

These few articles that appear to directly question the broad use of usability testing as a method are testament to the industry’s blind faith in its use. Whilst certainly there are many occasions where usability testing is effectively used, these articles indicate there are too many instances when usability testing is incorrectly applied. Figure 2 presents a double diamond approach as described by Austin [2]. Given the double diamond framework, it is argued that usability testing is most effectively used as an assessment research approach as defined by Rubin and Chisnell [18]. Thus, usability testing is limited in providing effective and insightful results in most instances where design is at a more exploratory phase according to the double-diamond process.

Fig. 2.
figure 2

Double diamond framework as presented by Jane Austin [2].

Recommendation: Choosing the Right Method.

One way to support UX professionals choose a more appropriate method of research would be to supply them with a framework, which supports decision of research method given the stage of the project. One such framework that has come from industry, which integrates well with the double-diamond design process, is Emma Boulton’s Research Funnel [4]. Boulton’s Research Funnel separates research into four distinct phases directly linked with the stage of the design project (see Fig. 3). Further fleshing this model out to more strictly define when to conduct usability testing and disseminating that information may prove helpful to the UX industry.

Fig. 3.
figure 3

Boulton’s research funnel framework [4].

2.2 Hypothesis Testing Versus Objectives Setting

Usability testing in the lab is definitely a qualitative research method, especially considering the literature often recommends smaller sample sizes ranging from about four participants tested iteratively [4], to seven or eight [5]. Despite lab usability testing being a qualitative technique, the authors of this paper have often witnessed UX professionals developing “hypotheses” prior to conducting their research.

The problem with creating hypotheses for a qualitative method like lab usability testing is that it is simply wrong. Hypotheses are generated for the purposes of hypotheses testing is inherently a quantitative technique to statistically make inferences about a population, based on the sample that has been tested [7]. Hypotheses are often generated based on a sound literature review, or indeed, by conducting qualitative research to gain a better understanding of the subject of study. In other words, qualitative research methods are intended to be hypothesis-generating, and not intended to deduce the validity of a given hypothesis.

So why would UX professionals feel they need to create hypotheses for conducting usability testing? Perhaps one reason why the practice is done is to give their stakeholders the appearance that usability testing research is an objective and robust research. Arguably, the authors’ experience suggests that UX researchers are often stating objectives for the usability testing, but labelling these as hypotheses and, sometimes, using wording to mirror what would often be considered an alternative hypothesis. For example, when testing navigation of a shopping journey a hypothesis might say, most users are unable to complete a shopping journey without assistance. Such a statement is problematic because it has the potential to bias the UX professional one way or another before any research has taken place. Another example when testing the proposition of a new product might be, Users will like the proposition because of its properties X, Y, and Z. Hypotheses like the latter are even more dangerous because, not only do these hypotheses have potential to bias the research, but they also present a danger of the UX professional making baseless claims about the potential success of a product.

Recommendation: Setting Objectives.

As Rubin and Chisnell [18] suggest, it is perfectly reasonable to acquire quantitative measures in usability testing. Often these quantitative measures take the form of time to complete task, number of errors, number of prompts, or from survey ratings taken from participants reflecting their attitudes or opinions about the system they just tested. UX practitioners must be careful when collecting these quantitative measures when conducting typical usability tests consisting of approximately five participants. Five participants are nowhere near the required amount to make generalizations about the results of these measures applied to entire population of users. Instead, it must be made clear that, in most cases, UX professionals need to decide on predetermined criteria that these measures must meet to satisfy a defined usability standard. Any other use of quantitative measurement with such small sample sizes is arguably pointless.

2.3 Construction of Protocol

As stated in the introduction, texts intended to provide budding UX professionals with an introduction to this field will often include a section or chapter for usability testing [e.g., 1, 3, 10, 17]. Unfortunately, these texts only provide an overview of the logistics for setting up this kind of research (e.g., participant recruitment, equipment and set up, defining objectives, setting tasks). None of these texts go into sufficient detail concerning how to construct tasks and questions for these sessions, and none provided any reference to books like Rubin and Chisnell [18], or Dumas and Loring [6], which cover this issue with sufficient detail. Rubin and Chisnell have provided a good overview and classification of the types of usability tests, and adequate detail on how one might go about preparing the protocol. Dumas and Loring, however, go into detail on how to prepare, as well as the setting up of tasks and questions.

When comparing the referenced texts meant to train UX professionals on research methods to market research texts meant to train market researchers on proper protocol [e.g., 58], the UX texts are severely lacking in the detail required to properly train professionals on UX research protocol. Indeed, only Dumas and Loring [6] adequately addresses issues for avoiding leading questions, but does not address double-barreled or hypothetical questions, which are addressed in market research texts [e.g., 58]. The lack of providing proper protocol information in texts dealing with usability testing means that there are a lot of UX professionals who simply are ignorant of these issues in questionnaire or interview protocol construction, or it is left to chance that they are properly trained by a UX professional who knows proper interview protocol construction. In our combined 25 years’ experience, we have often encountered scripts from UX professionals that contain questions that are leading, double-barreled, or hypothetical questions. All of these types of questions are known to undermine the integrity of the research, and there would almost certainly be variance in results found between professionals were ignorant about avoiding these types of questions, versus professionals who understood how to avoid these questions and construct robust research protocol.

Recommendation: Awareness is Key.

Fault for this ignorance amongst some UX professionals for effective and robust protocol construction lies with the UX professional bodies. When compared against market research, the market research texts and, presumably courses, do a better job of educating market researchers in detail on how to carry out various market research methods [e.g., 58]. Ideally, the UX profession has to do a better job in their texts, and arguably coursework in at least some cases, to provide detailed instruction concerning how to construct good research protocol and what types of questions to avoid.

2.4 Facilitation of Usability Testing

Facilitation of usability testing often suffers similar problems to those identified in the previous section concerning the construction of testing protocol. In fact, in our experience, if the test script contains problems with the way questions are constructed, often there are issues with facilitation as well. Also like the previous section, only one book in the UX literature properly covers how to facilitate a usability testing session, and what issues to avoid [i.e., 6]. In their book, Dumas and Loring [6], tackle subjects nearly always overlooked in other books on UX. These subjects include giving participants enough time to speak and sufficiently answer questions, when to interrupt, how and when to prompt during tasks, and how to handle delicate situations that inevitably occur in some sessions. Finally, Dumas and Loring state that there is no perfect usability test, but it is important that facilitators can reflect on their last session and understand what went well and what could be improved.

The results of any given round of usability testing is influenced most by how the testing is facilitated by a UX professional. In our experience, strong facilitators will enable participants to tell a story that expresses where the issues are in the system or interface that is being tested, and most importantly, why these issues are important to rectify for the participant. It is the stories, and the reasons why that support these stories that provides powerful results in qualitative research. Strong facilitators grasp the importance of storytelling and understanding the underlying reasons why participants say or do things during testing. They pull this information out of participants, seemingly effortlessly at times. Weaker facilitators who have not been properly trained tend to stay at a surface level, meaning that they do not probe to understand why participants behaved a certain way or provided a given response. These facilitators are prone to identifying an issue without fully gathering an understanding of why the issue is problematic for the user. Furthermore, weaker facilitators are more prone to glaze over issues and miss identifying them. Ultimately, it is argued that results from weaker facilitators, which lack that deeper understanding of issues, makes it more difficult for design teams to effectively address a given issue. Thus, it is argued that this usability testing problem may explain some of the variation found in Molich et al.’s CUE papers [e.g., 121314]. It is important that discrepancies between the quality of facilitators is minimized so that more consistent results from usability testing can be obtained regardless of who is the facilitator.

Recommendation: Proper Facilitation Mentoring.

Effective facilitation of usability testing can be tougher than it looks (see Fig. 4). Similar to the previous usability testing problem, it is the responsibility of experienced UX professionals and their professional associations to set up mentoring programs. A certification for experienced UX professionals that may mirror that of the Market Research Society (MRS). It would mean proper training regarding facilitation of usability testing among other UX research methods, and novice UX professionals can see what “good” looks like by observing experienced and certified professionals. Experienced UX professionals would then mentor novice UX professionals until a level of competency is achieved in facilitation skills.

Fig. 4.
figure 4

Facilitation in lab usability testing can be tougher than it looks.

2.5 Reporting

Several UX texts do address reporting the results of a usability test [e.g., 11018]. Nonetheless, reporting is the culmination of everything the facilitator had done prior to this point. Therefore, if there were problems with the usability testing prior to reporting, then reporting would be influenced accordingly. For instance, we have experienced several instances where UX professionals report on the likelihood that a concept might be a success, based on the responses these professionals gathered during testing. This can be problematic because, as stated earlier, it is statistically not possible to suggest that a concept may or may not be a success based on the opinion of five or six participants in a round of usability testing. While that example might seem blatantly obvious to point out here, there are other more common ways that UX professionals use, which subtly communicate to stakeholders that an inferential statistic is being made from a sample size of about five participants, to their population of users. These ways include using percentages to express success/failure or some other type of behavior (e.g., where participants clicked) during testing; or using the numbers of participants who was successful or unsuccessful in a task out of the total number of participants (e.g., 2 out of 5 failed) when no predetermined standards were agreed on prior to testing. The latter may seem particularly contentious, but we would argue if no predetermined standards were set, then the test is more exploratory in nature and, therefore, why these participants failed should take precedence over how many.

How reports are written and delivered can be especially problematic. If reporting as described above does explain some of the variance witnessed in Molich et al.’s research [e.g., 12, 13, 14], then it presents a fundamental issue for the UX profession. Specifically, poor reporting can set improper expectations from naïve stakeholders concerning what can, and what cannot be determined from a usability test. For example, imagine a senior executive stakeholder of a large corporation has invested in usability testing of a new concept the corporation will launch in the coming months. The usability testing report indicates that if certain changes are made, then the concept should be a success. An experienced UX professional would know that such a statement is patently wrong: usability testing can de-risk a new concept, but it cannot determine outright whether or not a concept will be a success. Still, if this is the senior executive’s first experience with usability testing, then s/he knows no better, and in its simplest form, one of two things are likely to happen: 1) The concept fails and the senior executive distrusts the practice of usability testing as a whole, of 2) The concept is a success and the senior executive has an expectation set that usability testing can effectively predict the success of a concept. Either way, the experience sets a bad precedent and one that an experience UX professional must unnecessarily overcome to promote the value of UX research.

Recommendation: Texts and Templates.

It may be helpful for new UX professionals to read in texts what is and what is not acceptable to report in usability testing reports, and why these rules are in place (e.g., setting expectations as described above). Templates may be used in texts and by experience UX professionals disseminating these templates to less experienced professionals for them to understand what a “good” report looks like.

3 Conclusion

There may exist more problems with usability testing that have not been identified in this paper. Nonetheless, given the literature and through our professional experience, we feel that these five problems identified are the most notable. Therefore, it is argued that the five problems identified in this paper contribute to the variation in perceived quality of UX research in the industry, and must be addressed to help stop inaccurate stakeholder expectations and negative perceptions of UX from other disciplines.

This is not a full treatise about usability testing and its problems. Nonetheless, each problem is introduced in hopes that future researchers will pick up the baton and further investigate the legitimacy and potential impact of each problem. Therefore, we encourage others to test the validity assumptions and arguments made in this paper. As with any researcher, we would be very interested to better understand what drives variation in usability test findings. With that better understanding, it is hoped that the UX community could work toward solutions that would improve the consistency, validity, and efficacy of UX research as a whole.