1 Introduction
Mobile devices are the most popular computing devices [
62], and mobile applications are an integral part of people’s daily lives. Modern mobile devices are equipped with touchscreens, providing rich experiences for users; however, they also force developers to test and validate the functionality of their apps either manually or using automated tools. In the testing process, developers may neglect to evaluate their software for approximately 15% of the world’s population with disabilities [
66], many of whom cannot use conventional interaction methods, such as touch gestures. According to law enforcement and social expectations, developers should design apps accessible to all users, regardless of their abilities. Still, prior studies have revealed that many popular apps ship with accessibility issues, preventing disabled users from using them effectively [
2,
23,
56].
App developers are aided by accessibility guidelines published by companies such as Apple [
16] and Google [
8], as well as technology institutes such as the World Wide Web Consortium [
65]. In order to understand how people with disabilities use mobile apps, developers are encouraged to conduct user studies with users (preferably with disabilities) using assistive services, such as screen readers. Despite the fact that software practitioners acknowledge the importance of human evaluation in accessibility testing, they admit that end-user feedback is difficult to obtain [
19]. Furthermore, for small development teams with limited resources, finding users with various types of disabilities and conducting such evaluations can be prohibitively challenging and expensive.
Using accessibility analysis tools, app compliance with guidelines and accessibility issues can be detected automatically [
5,
7,
17,
18]. An app’s User Interface (UI) can be analyzed, for example, to determine if the contrast between elements and backgrounds is above a certain threshold or if the button area exceeds a specified area defined in the guidelines. Solely analyzing the UI specification of an app may not reveal many accessibility problems that are only present when assistive services are used, such as screen readers. Blind users, for instance, can use a screen reader like Android’s TalkBack to navigate UI elements and perform actions. If TalkBack is unable to focus on an element, the element becomes completely inaccessible.
Generally, automated accessibility testing does not consider assistive services, except for a few recent research tools [
1,
58,
60]. Latte [
58] assumes the availability of GUI test cases for validating an app’s functionality. The test cases are then repurposed to execute with assistive services, e.g., TalkBack, for accessibility analysis. Since developers rarely write GUI tests for their apps, Latte is limited to situations where GUI tests are available. According to a recent study, over 92% of Android app developers do not have GUI tests [
46]. Other works try to mitigate this issue by analyzing a single app screen while ignoring the app’s functionalities. ATARI [
1] assesses the focusability of screen elements by navigating sequentially using TalkBack. However, ATARI does not consider any actions, e.g., clicking, and depends on the developers/testers to provide the screen. Moreover, both Latte and ATARI consider one type of navigation in TalkBack: linear navigation. Groundhog [
60] addresses this limitation by using an app crawler to visit multiple screens and assessing whether the elements are clickable by TalkBack. Although Groundhog considers performing actions, its analysis is limited to one action on one screen and cannot detect accessibility issues occurring in a sequence of interactions. Moreover, it is limited to one action, i.e., click, and does not support other actions like swipe or type. Finally, Groundhog cannot visit and analyze various parts of an app due to the limitation in random input generation, e.g., it cannot pass the login screen.
The key insights that guide our research are (1) mobile developers and testers still prefer manual testing in-app development [
38,
41,
47], (2) assistive services need to be incorporated for evaluating apps’ accessibility, and (3) there is a lack of expertise and knowledge among many mobile developers and testers on how to properly evaluate the accessibility of their apps with guidelines, automated tools, and assistive services. A survey found that 48% of Android developers cite lack of awareness as the main reason for accessibility issues in apps [
2]. Another survey found that 45% of accessibility practitioners are experiencing problems related to accessibility development and design, such as inadequate resources and experts [
19].
Informed by the above-mentioned insights, we have developed a new form of automated accessibility analysis, called
A11yPuppetry, that aids developers in gaining insights on accessibility issues of their apps. Developers and testers can evaluate their apps manually by using touch gestures, while
A11yPuppetry records these interactions. After that,
A11yPuppetry interacts with the app on another device using an assistive service to perform the equivalent actions on behalf of the testers, regardless of their knowledge and expertise in accessibility and assistive services.
A11yPuppetry is inspired by Record-and-Replay (RaR) techniques, such as [
31,
33,
57], where a program records the user actions on an app and replays the same actions on the same app in another device. However, to the best of our knowledge, all existing RaR techniques replay the recorded actions exactly as they are performed. For example, if the user touches specific coordinates of the screen, the replayer program also sends a touch event for the same coordinates.
A11yPuppetry is different from these techniques since the replaying part is completely done by an alternative way of interaction, e.g., a screen reader. More importantly,
A11yPuppetry generates a fully visualized report for developers after replaying the recorded use case with assistive services, which are augmented by accessibility issues.
This paper makes the following contributions:
•
A novel, high-fidelity, and semi-automated form of accessibility analysis that can be used by almost any mobile developer or tester to evaluate the accessibility of mobile apps with assistive services;
•
A publicly available implementation of the above-mentioned approach for Android called
A11yPuppetry [
59];
•
Conducting user studies with users with disabilities and creating a benchmark of real apps with accessibility issues confirmed by disabled users; and
•
An extensive empirical evaluation demonstrating the effectiveness of A11yPuppetry in identifying issues that the existing automated techniques cannot detect.
The rest of this paper is organized as follows: Section
2 motivates this study with an example and explains the challenges that we are facing. Section
3 examines the related literature, next the Section
4 provides an overview of our approach and the following sections explain the details of our approach. The evaluation of
A11yPuppetry on real-world apps is finally presented in Section
9. The paper concludes with a discussion of the avenues for future work.
2 Motivating Example
This section illustrates how users with visual impairments use screen readers to interact with apps. Further, we demonstrate a couple of accessibility issues that cannot be detected by conventional accessibility testing tools. Finally, we elaborate on the challenges of automatically recording touch gestures and replaying them with a screen reader.
Figure
1(a) shows the home page of the Dictionary.com app with more than ten million users in the Android Play store [
25]. Assume a tester wants to validate the correctness of a use case which consists of 3 parts: Selecting the “word of the day” and listening to its pronunciation, marking the word as a favorite, and reviewing or removing favorite words.
A user without a disability who can see all elements on the screen and perform any touch gestures can perform this use case fairly easily. First, she taps on the word of the day, box 10 in Figure
1(a), then the app goes to Figure
1(b). Next, she taps on the speaker button to listen to the pronunciation, pink-dashed box in Figure
1(b). Then to mark the word as a favorite, she taps the star button, yellow-solid box in Figure
1(b), and she can get back to the home page, Figure
1(a), by pressing the back button. Next, to see the list of favorite words, she taps on box 2. The app will go to the state depicted in Figure
1(c). To remove a word, the user needs to tap on the edit button, yellow-solid box in Figure
1(c), then the navbar changes to depict the number of selected words, and the delete button, Figure
1(d). Finally, the user selects the checkbox next to the word, and taps on the delete button, the yellow-solid box in Figure
1(d).
To perform the same use case, users with visual impairments, particularly blind users, have a completely different experience. They rely on screen readers, e.g., TalkBack for Android [
10], to interact with the app. Users can perceive the screen’s content by navigating through elements and listening to the textual description of the focused element by TalkBack. A common accessibility issue among mobile apps is the lack of content description for visual icons [
2,
23]. For example, if the star button in Figure
1(b) does not have a content description, a blind user cannot guess the functionality of this button. For the sake of this example, assume this app does not have such issues and all elements have proper textual description, e.g., box 2 in Figure
1(a) has a content description as “Favorites List”.
There are several ways of navigating the elements of an app with TalkBack. Using
Linear Navigation, the user can navigate to the next and previous element of the currently focused element by swiping right and left on the screen. For example, to reach the “word of the day” in Figure
1(a), which is
diphthongize, the user can start from box 1 (top left icon) and navigate to the next elements until it reaches box 10. Note that TalkBack may group elements for a more fluent announcement, like here, where a couple of textual elements are grouped into box 10. Secondly, the user can utilize
Jump Navigation to focus on elements with specific types, e.g., buttons or edit-text boxes. For example, by jumping in button elements, the user can focus on boxes 1, 4, 5, 6, and 7, pink-dashed boxes in Figure
1(a). The third way is
Touch Navigation where the user touches different parts of the screen, and TalkBack focuses on the elements behind the user’s finger. For example, if the user touches the top right of the screen in Figure
1(b), it focuses on box 2, and TalkBack announces “Favorites List”. Another way is finding the element through a search. TalkBack user can enter the name of the element she is looking for, either by text entry or voice command, and TalkBack focuses on the element with the same text. For example, by searching “View All” TalkBack focuses on box 9 in Figure
1(a).
Besides these navigating ways for focusing on an element, there are alternative ways to perform touch gestures. For example, the user can replicate the scroll action by swiping on the screen with two fingers. Also, the user can execute some predefined actions by performing special gestures. For example, swiping up then left is equivalent to going to the device’s home screen, or swiping left then right is equivalent to scrolling backward.
To click on an element, the user should perform a double-tap gesture on the screen when the target element is focused. TalkBack perceives this gesture and sends a click accessibility event,
ACTION_CLICK, to the focused button, which is the equivalent of tapping on the button by touch. After getting to the word of the day page, to listen to the pronunciation, the user needs to locate the speaker button, pink-dashed box in Figure
1(b). However, the element cannot be focused on by TalkBack as developers only set focus to its ancestor, the
RelativeLayout, and telling TalkBack to skip all the descendants, including the speaker button; therefore, this functionality is inaccessible for TalkBack users. While the unlocatablity of this element by TalkBack is a critical accessibility issue, Google’s Accessibility Scanner, the most widely used accessibility analyzer for Android, cannot detect it since the scanner does not consider assistive services like TalkBack into account.
Assuming the mentioned accessibility issue does not exist, the blind user continues the rest of the use case by selecting the favorite button, the yellow-solid box with the star icon in Figure
1(b), and then returns to the home page. After returning to the home page, the user needs to find Favorites List or box 2 in Figure
1(a). However, since the user was previously on this page, box 10 is focused. By navigating to the next elements, boxes 11 and 12, TalkBack automatically scrolls forward to fetch the items below; however, the app makes the upper menu disappear as shown in Figure
1(e). A sighted user can notice this major change in the screen since she can observe all parts of the screen; however, a blind user may not notice it. Consequently, the blind user cannot locate the favorites list button, initially located at the top right of the display. Even if the user searches for the word “Favorite”, Figure
1(f), there is no result since the favorites list button does not exist on the screen anymore. This is another example of accessibility issues that cannot be detected without considering exactly how blind users interact with apps, i.e., through a screen reader such as TalkBack.
While it is straightforward for most app developers and testers without disabilities to perform the aforementioned use case with touch gestures, none of the accessibility issues above could be detected unless the same use case is performed using a screen reader. Our objective is to record touch gestures from an arbitrary app tester, execute them using a screen reader automatically, and generate a report with detected accessibility issues. Now, we explain the possible challenges to realizing this idea.
•
Action Mapping. Although users with visual impairments also use touch gestures with screen readers like TalkBack, the way actions are performed are completely different. For example, as we mentioned in the example, clicking an element without a screen reader is a simple touch on the element’s coordinates; however, a screen reader user needs to first locate the element and then perform a double tap gesture to initiate a click. There is no trivial mapping between the touch gestures and the screen reader’s actions.
•
Action Approximation/Alternatives. In the case of having a mapping between touch gestures and screen readers, the actions are not completely equivalent. For example, sighted users can scroll different parts of the screen with different velocities; however, TalkBack users can only perform four limited scrolling, left, right, forward, and backward, where their start/end points and velocity are constant, regardless of what TalkBack user wants. On the other hand, TalkBack users can scroll through lists by navigating through items via swiping left or right. Either way, although there are equal actions with and without screen readers, their effects are different, making it complicated to ensure the apps are in the same state.
•
Element Identification. Besides the fact that actions are done differently with and without screen readers, the way elements are accessed is also different. As mentioned earlier, TalkBack may group multiple elements into one for a better user experience for visually impaired users. Moreover, if action is associated with an element of a group, TalkBack assigns the action to the whole group. For example, in Figure
1(d), a sighted user may tap on the checkbox to select the word; however, TalkBack focuses on the group of the checkbox and the word (pink-dashed box), not the checkbox itself. Therefore, to select the checkbox, a TalkBack user needs to focus on the group of elements and then perform a double-tap gesture.
•
Lack of Accessibility Knowledge. In traditional record-and-replay techniques, a tester can easily identify the bugs and issues since the replaying is supposed to be identical to the recording, and all interactions are familiar to the tester. However, it is not trivial for a sighted user to understand accessibility issues in a screen reader’s replays if she is not an experienced assistive-service user. That is why it is important to not only detect accessibility issues, but to also provide an explanation for developers as to how the detected issues hinder the visually impaired users.
4 Approach Overview
A11yPuppetry consists of four main phases, (1) Record, (2) Action Translate, (3) Replay, and (4) Report. In this section, we provide an overview of the approach and in the next four sections, we explain the details of each phase.
Figure
2 depicts an overview of
A11yPuppetry. The process starts with the Record phase when the user interacts with a device enabled with the Recorder service. The Recorder service listens to UI changes events and adds a transparent GUI widget overlay on top of the screen to record the user’s touch gestures. After receiving a touch gesture on the overlay, the Recorder replicates the gesture on the underlying app, and sends the recorded information to the server as an
Action Execution Report. The server will store the recorded information in the database.
In the second phase, Action Translation, the
Action Translator component receives the
Action Execution Report from the Recorder (containing UI hierarchy, screenshot, and the performed gesture) and translates it to its equivalent
TalkBack Action. For example, touching on the coordinates of the favorite button in Figure
1(b) will be translated to focusing on the favorite button and performing a double-tap gesture.
In the Replay phase, the TalkBack Action is sent to several replayer devices that perform the action. Each replayer device has a running TB Replayer service that receives TalkBack Action from the server, creates and maintains a
TalkBack Element Navigation Graph (
TENG) of the app, and performs the received actions with a navigation mode. We will define and explain TENG and navigation modes in Sections
6 and
7 in detail; however, for now, assume TENG is a model of the app UI designed for TalkBack, and a navigation mode is a way of locating elements, e.g., Linear or Jump Navigation. Once an action is performed, a TalkBack Execution Report is stored in the database. The TalkBack Execution Report consists of actions that are executed with TalkBack, screenshots, and UI hierarchy files of the different states of the app before, during, and after execution.
In the final phase (Report), the A11y Analyzer component reads the stored information in the database, i.e., Action and TalkBack Execution Reports, and produces an Aggregated Report of the recording, replaying, and the detected accessibility issues. The user can access this report using a web application.
7 Replayer
The third phase of A11yPuppetry replays the received TalkBack Action with TalkBack. Before the user starts interacting with the app, the recorder and replayer devices are in the same state, i.e., the app under test is installed and opened. In the replayer device, TalkBack and TB Replayer services are enabled. TB Replayer is an AccessibilityService similar to the Recorder service, which is responsible to communicate with TalkBack to perform the received action. For each navigation mode, i.e., Linear, Jump, Search, and Touch, there is one replayer device receiving the inputs from the server.
Recall that a TalkBack Action can be ElementBased (\(\mathbb {EB}\)), TouchGestureReplication (\(\mathbb {TGR}\)), or PredefinedAction (\(\mathbb {PA}\)). To perform \(\mathbb {TGR}(lg)\), TB Replayer makes a copy of the LineGesture lg, called lg′ and moves its coordinate 2cm toward the top or right of the display, then combine the two LineGestures (lg and lg′) and perform them when TalkBack is enabled. Performing a PA(t) is easier since it is predefined and not dependent on the app. TB Replayer has a database of PredefinedActions and can perform the actions accordingly, e.g., perform swipe right then left when t is “Scroll Forward”.
However, performing an ElementBased action is relatively challenging since it requires finding and focusing on the element first. Moreover, there are various ways of navigating to locate an element, i.e., Linear, Jump, Search, and Touch. To that end, we introduce TENG (TalkBack Element Navigation Graph) to model the different ways of navigating an app with TalkBack. After that, we define different strategies to guide TB Replayer on traversing the TENG of the app.
7.1 TENG
Simply put, TENG is a graph modeling the different states of TalkBack when enabled. TENG is defined over the UI hierarchy of an app screen, where the nodes include GUI elements that can be focused by TalkBack and the edges represent actions that can be done by the user (or TB Replayer) to change the focus from one node to another. For example, Figure
3(a) represents a part of the TENG of the app screen in Figure
1(d). For now, please ignore the Start and End red boxes, we will define and explain them shortly. The blue ovals represent control elements, e.g., buttons or checkboxes, and green-round boxes represent the textual elements. Also, the gray boxes are a View element containing a set of elements that are grouped by TalkBack to announce. Recall that in Section
2, we discussed TalkBack grouped elements that are related and associated the group with an action for a better user experience. In runtime, when Talkback is in any of these nodes (states), i.e., focused on their corresponding element, we call it an
active node. The solid arrows in Figure
3(a) represent Linear Navigation between elements, e.g., red arrows are associated with swiping right or moving to the next element. The dotted arrows represent Jump Navigation which changes the active node to the next control element. For example, if the Delete node is active, by swiping right TalkBack focuses on the text element that starts with “Favorite” and by swiping down, TalkBack jumps on the previous control element which is “Back”.
Besides the UI elements, TENG has some other nodes which we call
Virtual States. These states do not correspond to an element on the screen; however, they represent some internal states of TalkBack. For example, the virtual states
Start and
End in Figure
3(a), represent the states where TalkBack reaches the first or last element on the screen and notifies the user there is no element left to visit. Note that, the user can still change the focus to other elements by Linear or Jump Navigation, even if TalkBack is in a virtual state, e.g., swiping left from Start changes the focus to the compound element in the end.
Recall that TalkBack supports two other navigation modes, i.e., Search and Touch. We model these navigations in TENG using virtual states. Figure
3(b) shows the part of TENG related to the search navigation. The entry edge is a representative edge that comes from all nodes in TENG and is associated with three-finger tap. We did not draw all edges to not make the figure complicated and messy. Once the Search Screen is activated, the user can type the text she is looking for, then the result appears in a list (Result Screen). Once the user selects a search entry, TalkBack focuses on the selected element. Finally, the Touch Navigation is modeled and depicted in Figure
3(c). Whenever the user taps somewhere on the screen, TalkBack finds the underlying element and focuses on it. Similar to Search Navigation in Figure
3(b), the entry edge of the Touch State comes from all nodes of TENG.
Given a target element, we can use TENG to plan a sequence of interaction with the device to focus on the element. For example, similar to the last step of our motivating example in Section
1, assume we want to click on the checkbox and at the beginning TalkBack is focused on the Back button. Therefore, the TENG’s active node is the Back button in Figure
3(a), and the goal is focusing on the TENG’s node containing the goal element (which is the compound element denoted by the grey box), and then performing double-tap. There are various ways to reach the target node, for instance, by performing two swipe up actions, TalkBack first jumps to the Delete button and then to the target node.
However, traversing with TalkBack is not as easy as it sounds. There are three reasons that TENG may be modified during the interaction with TalkBack. First, the app may dynamically update the visible elements on the screen. For example, a slide show constantly changes the visible content after showing it for a specific amount of time. Secondly, TalkBack may change the app state by performing extra gestures for navigation. For instance, recall that TalkBack scrolled the page once it reaches the last element visible on the screen in the motivating example, Figure
1(e). Lastly, the app may change the focused element at runtime. For example, if developers do not want users to access certain elements, regardless of the rational behind this decision, they can focus on another element as soon as that element is focused by TalkBack. Therefore, we cannot rely solely on the TENG created UI hierarchy before navigation.
To that end, once TB Replayer performs an action associated with an edge, e.g., swiping right to focus to the next element, the service listens to any changes in the UI to determine if the UI hierarchy is changed. If anything changes, the TB Replayer recreates the TENG and continues the navigation. Otherwise, the service verifies if the current active node in TENG is focused by TalkBack. If it was not, then we mark the performed edge as ineffective and replan the locating path again.
7.2 Implementation
TB Replayer is an implementation of
AccessibilityService. It builds the UI hierarchy by analyzing all visible
AccessibilityNodeInfo on the screen. Then using the utility library provided by TalkBack [
12], TB Replayer creates TENG from the UI hierarchy. Basically, this library has some helper methods to determine elements that can be focused by TalkBack and the linear order among them. The virtual states in TENG are created and maintained by the TB Replayer service.
Each TB Replayer in a device is responsible for one navigation mode, e.g., Linear or Jump. To locate an element, the TB Replayer only uses the edges in TENG that belong to its navigation mode. For example, to navigate to the checkbox element from the back element in Figure
3(a), the TB Replayer for Jump Navigation only uses the dotted arrows or the Search Navigation only uses the edges in Figure
3(b). Once the element is located, the TB Replayer performs the desired action, e.g., double tap for click or double tap and press for long-press.
TB Replayer compiles a set of information and sends it to the server, including the UI hierarchy, screenshot, TENG, and performed actions in all stages.
9 User Studies
This section explains our experiments and user studies to evaluate the effectiveness and limitations of A11yPuppetry.
We selected five Android apps with possible accessibility issues reported in the literature [
60] or online social media [
39]. For each app, we designed a task (consisting of 21 to 33 actions) according to the functionalities of the app. Also, we included the parts of the app that were reported inaccessible in the task. The first four columns of Table
1 show some information about the subject apps and the number of actions involved in the designed tasks.
We use A11yPuppetry on each task of these five apps. We used an Android emulator with Android 11 and TalkBack (version 12.1) for both recording and replaying devices. Our prototype of A11yPuppetry enables us to perform the experiments synchronously (recorder and replayers are running simultaneously) or asynchronously (the recording can be done before the replaying). For the experiments, we use the asynchronous mode to not introduce any problem caused by network or other concurrency issues; however, in practice, the synchronous mode is more promising since the results can be obtained much faster.
To compare A11yPuppetry with existing work, we used Latte and Accessibility Scanner. Since Latte requires GUI test cases for the analysis, we transformed recorded use cases to GUI test cases. Scanner is not a use-case driven tool and scans the whole screen; therefore, we ran Scanner on the screens of the app after each interaction. Moreover, since in this experiments we are focused on blind users who uses TalkBack, we filter out issues that are not related to blind users, like small touch target size or low text contrast.
Besides experiments with these tools, we conducted two user studies with users with visual impairment who have experience working with TalkBack in Android. To connect to such users, we used the third-party service Fable.
1 Fable is a company that connects tech companies to users with disabilities for user research and accessibility testing. Fable compensates all user testers and is committed to fair pay for the testers.
2We used two services of Fable: Compatibility Test and User Interview. In the compatibility test, we provided the designed tasks and apps to Fable, then Fable distributed each task to three users with visual impairment. For all the users who participated in the compatibility tests, the assistive technology used was TalkBack (Screen Reader in Android). The users performed the tasks, and for each step of the task, they reported any issues they faced. Once we gathered all of the detected issues from A11yPuppetry and compatibility tests in Fable, we conducted a preliminary analysis and produced a comprehensive list of accessibility issues for each step. Then for each app, we sent requests for user interviews with Fable, where Fable scheduled a one-hour online interview with a blind user who uses TalkBack. During the interview, the user shared his/her Android phone screen. We asked the users to perform the designed tasks and explain their thoughts and understanding of the app’s pages. When they faced an accessibility issue that prevented them from continuing the task, we intervened and guided them to skip to the next step. Once the users finished the tasks, we started a conversation and asked them some specific questions about the tasks or general questions about their experience in working with screen readers and apps. In summary, each app is assessed four times: three users in compatibility tests and one user in an online interview.
The source code of
A11yPuppetry, a demo of the web interface, designed tasks, apps, and user responses can be found in our companion website [
59].The designed tasks can also be found in the
appendix A.
We would like to understand how
A11yPuppetry can help detect accessibility issues confirmed by users with visual impairment. As discussed before, all five tasks from five subject apps are assessed by users with disabilities, Accessibility Scanner, Latte [
58], and
A11yPuppetry. For
A11yPuppetry, we used four navigation modes (Linear, Touch, Jump, and Search). For user feedback, if at least one user expresses an issue with a certain action, we assume the action has an accessibility issue. The number of reported issues for each app can be found in Table
1. The last column (Total) represents the number of actions that at least one of the navigation modes in
A11yPuppetry reported an issue. As can be seen, the issues detected by Latte and
A11yPuppetry are proportional to the number of actions; however, Scanner reported many issues that can be difficult for testers to examine and verify.
Table
2 summarizes the effectiveness of Scanner, Latte, and
A11yPuppetry in detecting issues confirmed by actual users. For each tool, we calculate the number of user-confirmed problems that the tool could automatically detect. The key insight for designing
A11yPuppetry was that a human tester interacts with it and interprets the results to locate accessibility issues that could require human knowledge to detect. Therefore, for
A11yPuppetry, we also calculate the number of user-confirmed issues for which evidence of the same issues exists in the report of
A11yPuppetry. Table
2 shows the results obtained for each tool in comparison to the user-confirmed issues. As can be seen, even the automatically detected results of
A11yPuppetry outperforms the existing tools. On average,
A11yPuppetry could detect more than 70% of issues confirmed by users.
Results from Table
2 indicates that
A11yPuppetry outperforms two existing accessibility checkers Latte and Accessibility Scanner. We further summarize what issues existing accessibility checkers can and cannot detect.
Accessibility Scanner is a dynamic accessibility testing tool on Google Play Store that provides accessibility suggestions based on scanned screens [
5]. Developers can either scan a single screen or a series of snapshots through recording and Accessibility Scanner provides the results of the scan to them. According to its official documentation, Accessibility Scanner reports four types of accessibility issue.
•
Content Labeling. Issues related to the content labels, such as missing labels, unclear and uninformative link text, and duplicate descriptions.
•
Implementation. Issues inside View hierarchies that might hinder people with motor disabilities from interacting with a layout, such as duplicate clickable views that share the exact screen location, unsupported item types for Android Accessibility Service, traversal orders, and text scaling.
•
Touch Target Size. Identifies the small touch elements. The threshold of the element size can be adjusted in Accessibility Scanner settings.
•
Low Contrast. Identifies elements with a low contrast ratio between text and background or between background and foreground. Similar to touch target size, the threshold of the contrast ratio can be adjusted in Accessibility Scanner settings.
As Accessibility Scanner does not incorporate any assistive service during the evaluation apps, it cannot detect issues related to unfocusable element and ineffective actions nor provides evidence for difficulties in reading that A11yPuppetry supports. A detailed description of these issues is provided later in this section.
Latte [
58] relies on the availability of the GUI test cases for detecting accessibility issues in Android and only supports the linear navigation of TalkBack for locating an element. Therefore, Latte can only detect unfocusable elements and ineffective actions related to linear navigation. Latte can neither provide any evidence for developers about the
uninformative textual description, nor
difficulties in reading.
To have a better understanding of the detected issues, we manually analyzed all reported issues and categorized them into five categories: (1) Automated Detection the ones that both users and A11yPuppetry reported, (2) Evidence Provided the ones that users reported and A11yPuppetry provide some evidence of the existence of such issue in its report which can guide the tester to detect the issue, (3) Unsettled Issues that A11yPuppetry reported, but users did not find significant, (4) Flaky Issues that A11yPuppetry mistakenly reported as issues, and (5) Undetected Issues are the one that users reported but A11yPuppetry did not provide any evidence of such issue. In the following, we explain the subcategories of each of these categories and provide illustrative examples.
9.1 Automated Detection
Missing Speakable Text. This issue (a visual element without the content description) is among the most common types of accessibility issues in mobile apps [
23]. Due to the nature of this issue, existing accessibility testing techniques, like Accessibility Scanner, can detect this issue by only analyzing the layout of the app without considering assistive services.
A11yPuppetry detects such issues using the Search navigation, i.e., if an element is not associated with a textual description, it cannot be searched with TalkBack.
Unfocusable Element. Here, an element associated with a functionality or certain data cannot be focused by TalkBack; as a result, TalkBack users cannot access them or even realize such an element exists. In Section
2, we gave an example of such an issue (the speaker button in Figure
1(b)). Note that this issue cannot be detected by Accessibility Scanner since it requires assessing whether the element is focusable by TalkBack in runtime.
Sometimes the unfocusable element belongs to a minor feature that the user may not need. For example, the collapse button in the iSaveMoney app that hides the details of expenses (red-dashed box in Figure
5(a)). However, sometimes this issue becomes critical. For example, on one of the search pages of Expedia, none of the elements on the screen, including Navigate Up Button, are focusable, making the user confused. A user mentioned: “After typing New York and pressing the search button, I am unable to move around the screen at all. None of the gestures that I use to navigate or read the screen work.”
Ineffective Action. Sometimes elements are focused on by TalkBack, but the intended action cannot be performed. For example, in the iSaveMoney app, many buttons, including all yellow-solid boxes in Figure
5(a), can be focused by TalkBack. However, after performing a click action by double tapping, nothing happens. It seems the underlying reason behind this issue is the customized implementation of the button, which is sensitive to touch gesture and not click action. The issue is also found in Doordash when the user wants to change the delivery option to pick-up.
9.2 Evidence Provided
The following issues are reported by users and not by A11yPuppetry. However, the aggregated report of A11yPuppetry, including the annotated video and blindfold mode, provides evidence of these issues. The report can help accessibility testers find these types of issues faster without the need to interact with an app multiple times.
Uninformative Textual Description. The main purpose of content description for elements is to help users with visual impairment understand the app better; as a result, merely having a content description does not improve accessibility.
A11yPuppetry is not capable of analyzing the semantics of content descriptions; however, its blindfold mode lists the texts that are announced while exploring the app. A developer/tester can determine whether the textual descriptions are informative or not by reading the blindfold mode report. The example of blindfold mode can be found in figure
4(e). Here are some examples of this type of issue confirmed by users.
•
The textual element has some random or irrelevant data. For example, the notification icon in ESPN, highlighted button in Figure
5(b), has a content description “Í”, which is not informative
•
The elements associated with a functionality, e.g., button, checklist, or tab, should express their functionality. While TalkBack takes care of standard elements like android.widget.button, it does not announce the functionality of non-standard elements, e.g., a button which is a android.widget.TextView. Doordash app has many of these issues, e.g., “Save” without button or “Pickup/Delivery” without announcing toggle.
•
The textual description should describe the purpose of the element completely. For example, on the renting page of Expedia, there is a compound element described as “Pick-Up”; however, it is unclear if it is related to location or date. A sighted user can easily recognize it by looking at the pinpoint icon inside this element which hints this element is related to the location of picking up.
•
Sometimes, the textual descriptions provide complete information; however, they can be incorrect. For example, the traveler’s element, highlighted in Figure
5(c), clearly shows there are 3 travelers selected, but its textual description is “Number of travelers. Button. Opens dialog. 1 traveler”, which is incorrect.
Difficulties in Reading. Besides the textual description of elements, the way the texts are announced by TalkBack is important for understanding an app. We found a few accessibility issues reported by the users that make it difficult for them to perceive the text. This kind of issue can be detected by testers by manually analyzing the
annotated videos and
blindfold mode. The examples of annotated replayer video and blindfold mode can be found in figures
4(b) and
4(e), respectively. For example, in Dictionary, paragraphs of texts cannot be read as a whole; the user has to read a long text word by word. Or in the Doordash app, yellow-solid boxes in Figure
6(a), each category on the main page is announced two times, one time the visible text, e.g., “Grocery” or “Chicken”, another time the image which does not have a textual description, announced as “unlabeled”. In another example, all textual content of the summary block in the iSaveMoney app, green-dotted highlighted box in Figure
5(a), is announced altogether in an unintuitive order, and the user had to change the reading mode to understand each word. Although these issues do not make the app incomprehensible, they create barriers to blind users. We asked one of the interviewees how they felt about this kind of inaccessibility, and he said he could deal with them “but we, blind people or deaf people, deserved the same amount of dignity as others.”
9.3 Unsettled Issues
A11yPuppetry detected some issues that the users in our user study did not find to be significant. Mainly these issues belong to Jump and Search navigation modes. In the Jump navigation mode, TB Replayer tries to locate the element using jump navigation (going to the next control or heading element); however, sometimes, it is not possible to reach to element since it does not have proper attributes, e.g., it is not a button. TB Replayer with Search navigation mode tries to locate the elements by searching their textual description; however, when there are multiple elements with the same description, this mode cannot locate the element correctly. Although users mentioned it would be nice if the attributes were set properly so they could use different navigation modes; they did not find these issues important since they usually do not use Jump and Search navigation modes. We further examined why users do not use these modes that often in Section
10.
9.4 Flaky Issues
Sometimes
A11yPuppetry reports issues that are not correct, which is caused by technical problems with the experiments. The main characteristic of this category is that by rerunning
A11yPuppetry, the issue may not be reported again. There are three main technical problems. First, TalkBack sometimes freezes and does not respond properly and on time, making
A11yPuppetry think the app has accessibility issues that do not let TalkBack continue the exploration. Secondly, the recorder may record incorrect an element; for example, on the signup page of the ESPN app, instead of recording a button, it records a transparent view covering the button, which does not interface with the touch interaction. Lastly, the apps can be changed and be in different states on TB Replayer devices. Mainly this issue is caused by A/B testing, where developers dynamically show different pages to different users to measure some metrics about their product. For example, Figures
5(d) and (e) are two different fragments of changing the number of travelers in the Expedia app. If the recorder records the action in Figure
5(d), the same element cannot be found in Figure
5(e) since the structure is totally different.
9.5 Undetected Issues
As expected,
A11yPuppetry cannot detect all forms of accessibility problems, and the best way to evaluate the accessibility of apps is by conducting user studies with disabled users. We categorized the limitation of
A11yPuppetry in the following categories.
Improper Change Announcement. As users interact with mobile apps, the layout constantly changes. A sighted user can monitor all of these changes to understand the latest state of the app, while it is much more difficult for users with visual impairment to realize something is changed in the app. During our interview, users reported a couple of these kinds of issues. For example, when the user presses the search tab in the Doordash app, the red-dashed box in Figure
6(a), a completely new search page appears without any announcement for TalkBack users. One participant mentioned “My preference is that whenever something like that happens, [TalkBack] moves the focus up to where the new content begins because someone as a screen reader won’t necessarily [realize the app is changed].”
Excessive Announcement. On the other hand, it can be problematic and annoying when TalkBack announces content more than a user’s need. For example, in the Expedia app, when a user types a name in the search edit box, TalkBack interrupts the user by announcing “Suggestions are being loaded below”. Although it is informative for users to know the search results are loaded on the fly, it is annoying to interrupt constantly.
Temporary Visible Elements. Sometimes apps introduce new elements for a short period to notify the user something has changed and let the user undo or do something relevant to this change. For example. in the Doordash app, when the user saves a restaurant as her favorite, a pop-up box appears, Figure
6(b), notifying the user the store is saved and disappears momentarily. A blind user is informed of this change, but does not have enough time to focus on the appearing dialogue box.
10 Discussion
The previous section demonstrates the effectiveness of A11yPuppetry in providing insights and detecting accessibility issues. This section discusses other findings from the user studies that might be insightful for future research work.
TalkBack Interaction Preferences. We further examined how users with visual impairments interact with apps using TalkBack. We asked the interviewees to explain the different ways they use TalkBack. If they did not mention any of the navigation ways that we found in TalkBack documentation, we asked them if they are aware of them.
Generally, the primary way of navigation mode for all participants is Linear navigation. A user mentioned “I’m more into the flick, element to element, to explore an app and understand its layout.” This mode is especially used when the user interacts with an app or page that is unfamiliar.
The next favorite way of navigating is through Touch mode; however, it is usually used in certain scenarios. For example, when a user knows about the possible location of elements, the user is likely to use the Touch navigation mode. One participant mentioned “The back buttons are always at the top left, usually so... I’m going to put my finger at the top left to find that back button.”. Also, when a user cannot find the element or is stuck in a loop, the user is more likely to use touch to find the target element.
Some interviewees said they might use Jump navigation for headings in the apps that they are familiar with. One participant said “If I don’t know [the app] well enough... I’m going to flick through the whole thing to figure out the layout. If I know it well enough, then I probably would switch to the heading option and then search by heading”. However, almost none of the participants are willing to use the Search navigation mode. One user mentioned “I know [search] is there. But I prefer to just hunt for [the elements]. It gives me a more experience with the app.”
We also realized users do not want to use other actions like scrolling, since scrolling confuses them in understanding the new state of the app. A user said “[I use scrolling] if I know an app really well. But sometimes I find that when I do the scrolling thing, it’ll get me into something else... sometimes it’ll get me where I really don’t want to be. So I have a tendency not to want to do it.”
Context. A common accessibility issue in mobile apps is missing speakable text [
2,
23]. Although missing speakable text degrades the user experience and ability to locate elements, sometimes users can infer the functionality of an unlabeled button given its context. For example, the user can view the list of saved stores in Doordash and remove any of them, as depicted in Figure
6(b). The element for removing a store is an icon with the shape of a heart without a content description. However, our interviewee did not have a problem with locating this button. He mentioned “That is a good layout, an accessible checkbox next to [the restaurant], which is checked unchecked. I have seen these checkboxes on the home screen. I don’t like them on the home screen because the user doesn’t know what that checkbox actually does. The common sense here would tell you I’m in the saved stores’ section. So if I uncheck a box, it’s going to remove that.” Anyway, this observation should not encourage developers not to care about missing content descriptions; on the contrary, it emphasizes the importance of context for users with visual impairment to understand the app better.
Advertisement. In our experiments with
A11yPuppetry, we did not observe any ads. However, if an interstitial ad appears during the replay process,
A11yPuppetry may fail to continue as the appearance of ads is random and irregular. For example, for the Dictionary app, the interstitial ad, such as Figure
6(c), might appear when the user searches for a word. Disabled users have difficulty noticing the occurrence of the ads until they get stuck in the ads window for a few minutes. Even if they are aware of the ads, closing them and returning to the previously interrupted use case is challenging. One of the interviewees tried to locate the app with Linear and Touch navigation modes, but the ad’s close button was not focusable by TalkBack. As a result, the user had to restart the app (close and open again) to continue the task.
All the interviewees are cautious about the in-app advertisements. As one stated, “I tend not to open [the in-app advertisements] because half of the time, these advertisements cause problems.” In addition, most interviewees expressed a willingness to pay for the ad-free version if the price is not too high, so they do not have to deal with ads while navigating apps. A user mentioned: “If the app gives me the option to do without ads with a small price, I pay the small price just so I don’t have to deal with the ads. Most of the time [the ads] don’t work with the screen readers.” Nevertheless, previous research indicates that some apps still contain ads even if users pay ad-free fees [
32].
To the best of our knowledge, only one previous research investigated the impact of ads on disabled users. The research found that most ads are represented in GIFs, and more than half of the sampled ads have no ALT tag [
64]. Therefore, screen readers cannot read the contents of the ads to blind users. Other researchers investigated the impact of ads on the whole user group, not just disabled users. The negative influences of ads include privacy threats, significant battery consumption, slowing down the app, and disabling an app’s normal function [
29,
32]. We believe that this negative impact is further magnified for disabled users.
There are some design implications for in-app advertisements. Generally, ads that take the entire screen are named interstitial ads, while ads that are represented as horizontal strips are named banner ads. The ads should be announced correctly via Assistive Services so disabled users can know the occurrence of the ads. In addition, developers are encouraged to design banner ads, since the banner ads usually do not disable an app’s functionality. By contrast, interstitial ads significantly attract users’ attention and even require users to close the ad manually [
32].
Guided Navigation. The interviewees enjoyed interacting with an app when the app guided them through the process. In particular, Expedia did a great job in reserving flights: it consists of several steps like asking about the origin and destination airports, and dates. Once each step is done, the focus is changed to the next question and also announces the changes. Users are also able to get out of this selection and get back to the search page to change or view other information. One of the interviewees was especially happy about the calendar, Figure
6(d), and mentioned “That was one of the coolest mobile calendars I’ve ever used because it walked me through where I was. I selected the start date, and it told me that, and then it said, pick your end date, and then it summarized with states, like September 19th Start date or September 20th in the trip.”
Alternative Suggestion. As we discussed before, there are several complex touch gestures that do not have a corresponding equivalent in TalkBack, e.g., dragging or pinching. Developers are recommended to provide alternative interactions for complex gestures. For example, the calendar widget in Expedia, Figure
6(d), is designed to allow sighted users to modify their travel dates by dragging the start date to end date. For TalkBack users, the app is designed to announce “Select dates again to modify” which is an alternative way of modifying the dates.
Common Sense. During the interviews, we noticed participants sometimes locate certain elements much faster than other elements. In particular, for elements like “Search” or “Back”, instead of using Linear navigation, they explored certain parts of the app by Touch navigation to locate the element. We asked how they locate these elements and they generally responded to do so with the help of common sense. For example, the back button or open navigation drawer is usually located on the top left of the element, or menus are located in the footer. Common sense is not limited to similar elements on the screen. In the interview for the Doordash app, the interviewee found the button that shows the address of a restaurant pretty fast, even though the button was unlabeled. When we asked how he found such an element, he responded “A normal company would put the address on top, you know. So I’m using it. That’s common sense.” Therefore, it is important for developers to not change the spatial aspects of UI elements without considering users’ habits.