4.1 Preparatory Design Work
Ideating. We started by ideating 10 different design dimensions to manipulate user sense of agency over time spent, drawing upon attention capture dark patterns that have been previously proposed [
44,
51,
52,
70]. For example, Zagal et al. [
70] introduce “playing by appointment” wherein users are required to return to game within a fixed amount of time or else lose a reward. This led us to ideate the Time Pressure dimension, which ranged on a spectrum from no control to full control. Another dimension, Content Selection, varied from maximum to minimum temptation level. We then translated each of these dimensions into 23 sets of three concrete feature ideas each that ranged along this spectrum in terms of how much support they offered for user sense of agency (some dimensions inspired multiple feature sets). For example, for the Time Pressure dimension, we imagined video recommendations that expired if not watched within 30 minutes (low sense of agency), ones that expired within a day (medium sense of agency), and ones that were always available (high sense of agency). For Content Selection, we imagined a search algorithm that was tweaked to show results with a maximum entertainment level regardless of the user’s actual query (low sense of agency), one version that showed both entertaining and relevant results (medium sense of agency), and one version with only relevant results (high sense of agency). As a group, we then scored these feature sets in terms of expected impact, novelty, and technical feasibility.
Prototyping. Paper mockups for the seven highest-scoring feature sets were evaluated in 13 co-design sessions with YouTube users, as described in our prior work [
44]. For example, Figure
5 shows the prototype for Content Selection with three different versions of search results. We initially anticipated building three different versions of SwitchTube along a spectrum in terms of their support for sense of agency (low, medium, high) to find a “Goldilocks” level of control as has been suggested in prior work on lockout mechanisms [
32,
46]. However, co-design sessions with YouTube users revealed that rather than having a stable preference at all times, users wanted different levels of control for different situations. For example, when they had a specific intention in mind, they preferred a search-first interface, whereas when they just want to relax or pass the time, they preferred a recommendations-first interface [
44]. Taking these findings into account, we instead designed two versions that support different levels of sense of agency, and a third version where users could switch between the two (hence the name SwitchTube). It also prompted us to consider that Switch (rather than Focus) might offer the highest sense of agency of all of the versions, a hypothesis that we test in this work.
We created an interactive mockup of our complete SwitchTube design in Figma and conducted usability testing with 4 participants, all university students who were active YouTube users. Participants completed four tasks, which helped us identify a number of smaller usability issues. One of the usability testing participants said they would like to use the low-agency version to “explore viral content” and the high-agency version to “focus on my goal,” which led us to call the two versions “Explore” and “Focus.” While labeling the two versions in this way could lead study participants to form preconceived notions of how to use that version (as opposed to say, “Version A” and “Version B”), we decided that this was worthwhile because it would make it easy for participants to recall the two versions in the exit survey and interview and seemed unlikely to signal to participants that they should necessarily prefer one version over the other.
Table
5 shows an overview of the three different versions of the final SwitchTube study app and their features. A screenshot of the homepage of each of the three versions is shown in Figure
1. A short video introduction and captioned screenshots of the entire app are also available on the Open Science Framework:
https://osf.io/z735n. We refer to the three versions of the app as the Explore Version, Focus Version, and Switch Version, and the two toggle options
withinthe Switch Version as Explore Mode and Focus Mode.
Our aim for Focus was to support the user’s specific intention for visiting the app if they had one (e.g., learning how to cook a turkey), whereas Explore was designed to maximize distractions that would take them away from their original intention. In doing so, we expected that sense of agency (the user’s experience of being the initiator of their actions) would be supported in Focus and diminished in Explore. To this end, we appended “viral” to every search query submitted in the Explore Version. Our goal was not to fill Explore with viral content per se, but rather to add noise and temptations to the user’s search results. To simply add noise, we could have appended any term to the user’s search query (e.g., “zebras”), but our internal testing of several different terms (e.g., “entertaining,” “funny,” and “creative”) suggested that “viral” was the most effective at returning results that were also tempting.
Homepage video recommendations in SwitchTube were not personalized due to restrictions of the YouTube Data API (personal watch history and recommendations are difficult-to-access due to understandable privacy concerns). We further address the absence of personalized recommendations in the discussion section. Instead, homepage recommendations were drawn from the most popular videos in different YouTube categories (e.g., music, comedy) for the U.S. region. These are the same non-personalized recommendations that are displayed in YouTube’s own categories. In the Explore Version, the homepage featured an unlimited scroll of these videos and the video player showed related video recommendations below the video that was currently playing and autoplayed the next related video. In the Focus Version, the homepage hid recommendations by default and related videos and autoplay were removed by design.
Building. An illustrated software architecture model for the SwitchTube study app on Android is shown in Figure
6. The app assigned participants to experimental conditions and used a logger to monitor information about how participants used the app. The user interface had homepage video feeds, a video player, and search results. These were populated with data pulled from the YouTube Data API and the Google Custom Search API. Finally, the app conducted experience sampling, which we built as a custom system. All of this data was sent to the Firebase Realtime Database and then synced with Google BigQuery to allow for custom views and further analysis.
One particular challenge we encountered was a severely limited quota for the YouTube Data API, which we needed to populate the video recommendations on the homepage and video player (related videos and autoplay) and return search results. When we built SwitchTube, YouTube restricted developers to a default quota of 10,000 per day, whereas default quota at the time of previous research had been 1 million at the time of previous YouTube research [
28]. As a result, we quickly maxed out our quota in our testing of the app (e.g., a single search has a quota cost of 100). We tried the official form for requesting an increased quota, but received no response. Drawing upon our privileged position, we contacted multiple personal connections at Google in managerial positions who were also unable to get the YouTube Data API team to grant our request. In the end, we were forced to integrate a second API into SwitchTube (the Google Custom Search API), which we could pay for and use to populate search results, but it cost us considerable time and effort to do so. Two years later and after we had finished our deployment study, we received an email that YouTube has finally launched an official YouTube Researcher Program with expanded access to their Data API
4. We hope our report of this barrier lends support to the regulatory push to require large technology companies to provide researchers with greater access to audit and redesign their algorithmic systems for digital wellbeing.
Piloting. Our research team internally piloted the SwitchTube study app on a variety of Android devices over eight weeks. This again identified countless usability issues, from the font size of the experience sampling prompts to missing log data, that we resolved in the next version. We then recruited four students, all active YouTube users, from outside the research team for external piloting, which identified still further issues about study procedures, but also confirmed that the app was ready for deployment. We note that these participants identified several usability issues that we simply decided not to fix (e.g., when the phone was rotated horizontally the video had to reload). Our goal was not to rebuild a user experience as seamless as YouTube itself (which would have required a Herculean effort), but rather to develop a proof-of-concept system that would be acceptable enough for participants that they would engage with it sufficiently to address our research questions [
22].
4.2 Pre-Registered Hypotheses
Our third research question asks how adaptable commitment interfaces influence user experience. In line with this question, we posed several specific hypotheses. Following the best practices of the open science movement, we pre-registered these
before examining the data:
https://osf.io/sevfd. This helped us think through our study protocols in advance and guard against the the natural temptation of hypothesizing after the results are known (HARKing) [
14]. As noted in the pre-registration, in addition to this
confirmatory analysis with pre-registered hypotheses we also planned to conduct
exploratory analyses of the log data from the app, such as time spent in the different versions, but to use only descriptive statistics for this purpose.
In general, our pre-registered hypotheses tested whether or not the Switch Version (an adaptable commitment interface) provides the ‘best of both worlds’ across measures for sense of agency, satisfaction, and personal goal alignment. All of our hypotheses were tested based on measuring the mean per participant rating (1-7) of experience sampling mechanism (ESM) responses for these metrics.
H1: User Sense of Agency. Our first set of hypotheses (H1a-H1c) addressed user
sense of agency, which prior work suggests is at the center of user concerns with social media [
4]. Our expectation was: Switch > Focus > Explore, which corresponds to 3 pairwise comparisons:
•
H1a: The mean rating will be higher for Focus than Explore.
•
H1b: The mean rating will be higher for Switch than Explore.
•
H1c: The mean rating will be higher for Switch than Focus.
The features in Focus and Explore were based on our prior research into how the features of YouTube affect user sense of agency [
44]. As Switch lets users toggle between the Focus and Explore interface, we expected that this additional option would further increase user sense of agency.
H2: Satisfaction. Our second set of hypotheses (H2a-H2c) addressed user satisfaction, as in the short-term pleasure that users derive from social media apps. Our expectation was: Switch > Explore > Focus, again corresponding to 3 pairwise hypotheses that follow the same pattern as H1. In our previous study of YouTube [
44], users reported that homepage recommendations often provided short-term satisfaction, but Focus hides these by default. We expected Switch might provide a useful option to avoid recommendations at times when they are not wanted.
H3: Goal Alignment. Our third set of hypotheses (H3a-H3c) addressed personal goal alignment, as in how well app use aligned with the user’s long-term goals for use. Our expectation was: Switch > Focus > Explore, which again implies 3 pairwise comparisons. This is because our previous work found that that search often supports YouTube users’ personal goals [
44], but Explore minimizes the search option and adds distracting temptations to the results. On occasions where recommendations might actually better support the user than search (e.g., as in when survey participants said their goal was to find new or diverse content to watch), Switch would also provide that option.
4.3 Methods
Recruitment. We screened the 606 participants from our survey for the following three inclusion criteria:
(1)
Action or preparation stage of change with regards to their YouTube use (48% of survey participants met this criterion).
(2)
Own an Android smartphone with operating system version 6.x - Marshmallow or higher. This was because the study app did not support older versions (87% of survey participants met this criterion).
(3)
Spend a minimum of 10 or more minutes per day on the YouTube mobile app, according to self-estimate (75% of survey participants met this criterion). This was to ensure that participants already had a regular habit of watching videos on mobile, making it more natural for them to use SwitchTube.
This left us with 146 survey participants who were eligible to also become experiment participants. Given that a prospective power analysis for ESM studies requires an estimate of effect size that is difficult to obtain for a novel technology, we instead followed Berkel et al.’s guidance and informed our target number of participants using local standards in the HCI community [
5], where the median is 18 participants and the mean is 53 [
10]. Since we wanted to be able to detect differences between conditions with a high degree of confidence using frequentist hypothesis testing, we set a target of having 45 participants complete the field experiment.
We invited eligible survey participants to participate in small batches until we approached our target. In the invitation to the study and again upon installing the study app, participants were informed that the research team would monitor and analyze their activity in the study app, including their searches and the titles of the videos they watched.
Demographics. A total of 46 participants completed the experiment (see demographics in Table
6). We happened to oversample Asian, Black, and young people relative to the general U.S. population [
64].
YouTube Use. Field experiment participants spent a median of 140 minutes per day (interquartile range: 120-240) on YouTube across all devices in the week prior to the survey (self-estimated). Of this time, participants estimated they spent a median of 63% (interquartile range: 40-84%) in the mobile app. We again multiplied time on all devices with the percentage spent on mobile, for each participant, to find that field experiment participants spent a median of 87 minutes per day in the YouTube mobile app. This is considerably higher than the median of 34 minutes per day spent in the app by all survey participants, indicating that those who were invited and participated in the field experiment were heavier YouTube users.
Procedures. As shown in Table
7, participants completed an entrance survey, one week of use of each of the three versions of the SwitchTube app (Explore, Focus, Switch), an exit survey, and, for a subset of participants, an exit interview. In the entrance survey, participants completed additional questions about the nature of their YouTube use and received instructions for installing the SwitchTube study app on their Android phone from the Google Play store.
Upon installing SwitchTube, participants were assigned to start in either Explore or Focus following a counterbalanced assignment. Although it risked introducing ordering effects, we decided against also counterbalancing the Switch condition. Instead, Switch always came last so that we could understand when and why participants choose to toggle between Explore and Focus (RQ2) after having experienced each for a week. In each week, participants were required to use the app for 3 or more days for a total of at least 30 minutes. If participants did not meet these requirements, they were disqualified from further participation, but still compensated for their participation to that point.
The SwitchTube app collected both objective and subjective data. Objective data included logs of time spent, searches made, videos watched, and the source of watched videos (e.g., homepage recommendations). In terms of subjective data, participants were experience sampled using the three questions in Table
8. Conceptually, we wanted to capture an understanding of how the different versions influenced sense of agency, as well as satisfaction in the sense of short-term pleasure and goal alignment in the sense of long-term personal goals. Unfortunately we could not find validated scales that were short enough to be suitable for ESM, but we tested our wording for clarity in our piloting. This led us to clarify that we wanted participants to answer about this particular session of use
(“For this SwitchTube use”) rather than for their use of SwitchTube as a whole.
In terms of timing, a prompt appeared with these three questions on the participant’s phone when the following conditions were met:
(1)
The participant had not already responded in the past hour;
(2)
The participant had used the app for at least 30 seconds;
(3)
The app went into the background (e.g., the user exited the app to the phone’s home screen or they switched to another app).
If the participant did not respond within one minute, the prompt disappeared.
After completing one week each in Explore, Focus, and Switch, participants completed an exit survey. In the exit survey, participants were shown screenshots of the homepage, search results page, and video player in Explore and Focus as a reminder and answered which they preferred and why. Participants then explained when and why they switched between versions of the app. Finally, they answered which version of the app they preferred and why.
Exit interviews were conducted remotely over Zoom with a subset of participants using a method called data-driven retrospective interviewing [
60]. Using screen share, participants were shown counts, tables, and visualizations from their own log data, e.g., time spent in the app, occasions when they switched between versions, and their ESM ratings, and asked questions intended to elicit the “why” behind their behaviors. For example, we asked:
In the Switch version, you switched between the Focus and the Explore interface 17 times. Can you look at the table below, choose a couple of examples, and describe why you switched at that time?
We also retrieved the original change that participants wanted to make to their YouTube use from the survey and asked them whether the different versions of SwitchTube supported that goal. A total of 16 participants were interviewed, at which point we believe we reached data saturation with regards to our research questions. Interviews lasted about 45 minutes each.
Participant incentives were backloaded to encourage participants to complete the entire study, allowing us to compare their experience between conditions. This meant: $5 for the entrance survey, $15 for week 1 of app use, $30 for week 2, $50 for week 3 and the exit survey, and $20 for the exit interview. To protect data privacy, we assigned each participant a unique identifier (e.g., 446565) that was associated with their usage data. We connected this data to the participant’s personally identifiable information (e.g., contact information) only for the exit interviews, where we presented participants with a personalized summary of their usage. This research was approved by the University of Washington IRB.