Keywords

1 Introduction

The current work discusses the initial validation and preliminary results for ATHENA (Appraisal of Task Health and Effort through Non-Intrusive Assessments); a workload sensor with the ability to automatically assess and evaluate human workload. Once workload level is known performance can be optimized through adaptable automation [1] and task scheduling. Our machine learning enabled software sensor uses a variety of human behavioral features (such as linguistic analysis, keyboard dynamics and computer vision), all obtained with zero-intrusion and at little cost since the underlying behaviors contributing to the metrics are naturally exhibited during task completion.

ATHENA is ideally suited for NASA’s expected long duration space missions as well as other high criticality domains due to its zero-intrusion nature and use of a variety of metrics. By collecting naturally occurring behavioral metrics as well as information obtained through no contact sensors, ATHENA allows workload estimates to be obtained without modification or interruption of workflow, which can affect the crew’s workload and confound results (as seen with self-reports or additional equipment attached to the operator [2]). The variety of metrics allow an appropriate subset to be applied as the current context allows. This is important given the wide variety and multimodal nature of tasks will make some metrics useful during some tasks but not others (e.g. keyboard dynamics are not useful if typing is not occurring). We have shown a subset of our metrics can provide accurate classification, but the best classification rates are obtained when all available metrics are used [3].

2 Materials and Methods

2.1 Surveys

Ground truth workload data was obtained through surveys administered after the completion of each game. We used the Bedford Scale as a uni-dimensional rating scale to measure spare mental capacity [4]. The hierarchical scale guides users through a ten point decision tree, with each point having an accompanying descriptor of the workload level. For classification purposes we divided the Bedford into 4 levels following natural divisions provided by the scale itself: 1–3, 4–6, 7–9, and 10. We also used the NASA-TLX as a multi-dimensional rating scale to provide additional diagnostic information about experienced workload [5]. We divided the TLX into three levels such that 33 % of the data fell into each of a high, medium, and low category. This was done to provide us with the discrete categories needed for classification and maintain the nature of the TLX as a way to determine relative workload levels.

2.2 Procedure

We developed a NASA relevant testbed scenario by reframing the codebreaking game of Mastermind [6] as a task for astronauts performing a wiring reconfiguration to support the docking of future commercial crew and cargo vehicles. Nine professionals in the Minneapolis, MN area played the game. Linguistic data was collected through “mission control” texts using a pre-defined protocol, allowing for the collection of structured text and keyboard dynamic data, as well as unstructured linguistics via “think aloud” [7]. We believe workload is a multi-dimensional concept for which different types of workload can be manipulated independently. Therefore, Workload was manipulated ‘cognitively’ by using different in-game feedback mechanisms (that either made the task harder or easier) and requiring memory usage for previous guesses; and ‘temporally’ by adding various time constraints. The effect of a consistent audio background noise recorded from onboard ISS was also explored, see Fig. 1.

Fig. 1.
figure 1

ATHENA pilot test conditions. Condition 1: Baseline for all comparisons. Additional conditions used: Mental workload 2 & 3, Temporal workload 5 & 6, and noise 4.

We developed proprietary software to collect and analyze keyboard and mouse dynamics, and augmented techniques available in open source software to derive heart rate using the RGB video stream. The collected data was processed to obtain desired metrics such as heart rate [8], typing pauses and errors [9], as well as task performance [10]. Metrics were chosen based on a literature review and internal brainstorming.

3 Results

We used Simple Linear Regression as our supervised machine learning approach, via an interface with the WEKA toolkit [11], to classify each game played. Each participant played six games, each game divided into thirds for analysis. We performed 10-fold cross-validation using the survey scores as classification targets. We expected the total TLX to best classify all games, the Bedford and TLX mental subscale to best classify our Mental Low/Baseline/High conditions, and the TLX temporal subscale to best classify our Baseline/25/45 time limits. Noise was included as an exploratory variable.

Under these assumptions our classification accuracies ranged from 57 %–100 % (lower bound was 48 % accuracy when looking at a full cross between game type and survey result used for classification, but 75 % when removing the all games assumption), see Fig. 2. The TLX mental subscale (78 %), TLX total (100 %), and Bedford (79 %) had the highest classification accuracies for our Mental Low/Baseline/High conditions. The TLX temporal subscale (75 %) had the highest classification accuracy for our Baseline/25/45 time limits.

Fig. 2.
figure 2

Classification accuracies, with larger shapes indicating greater overall accuracy

To determine if the other TLX subscales could be of additional diagnostic value, we completed classifications using each TLX subscale, see Fig. 3. The TLX mental (78 %), frustration (100 %), and performance (88 %) subscales had the highest classification accuracies for our Mental Low/Baseline/High conditions. The TLX temporal subscale had the highest classification accuracies for our Baseline/25/45 time limits. The TLX effort (85 %) and physical (68 %) subscales had the highest classification accuracies for our Baseline/Noise conditions.

Fig. 3.
figure 3

Classification accuracies of TLX subscales, with larger shapes indicating greater overall accuracy.

4 Summary and Conclusions

Our preliminary results reflect the existence of different types of workload, with our zero-intrusion metrics demonstrating respectable classification accuracies when the variable causing workload (e.g. time) is matched with the type of workload assessed (e.g. temporal). For example, the TLX temporal subscale best classifies manipulations due to temporal demands but performs the worst for mental manipulations, as would be expected. Also, the similar classification accuracies for the Bedford and TLX mental subscale support the Bedford as a cognitive workload scale, while the increased classification accuracy seen with the TLX total supports the idea of workload being more complex than purely cognitive in nature and our ‘cognitive’ manipulations were not purely cognitive in terms of workload dimensions. Finally, our results indicate noise affected the perceived effort and physical aspects of workload, while our cognitive manipulations affected perceived mental, frustration and performance aspects of workload. Using our zero-intrusion metrics, no survey or subscale produced highest accuracy levels for all our conditions combined (i.e. all games) even though there were marked differences in subjective reporting, see Fig. 4. While our work points to the existence of different workload types more work is needed to fully understand them and the multi-dimensional concept that is workload.

Fig. 4.
figure 4

Survey results for different game conditions

Overall, ATHENA has demonstrated that accurate assessments of workload can be achieved by a sensor that solely utilizes zero-intrusion metrics. Thus, ATHENA represents a valuable step forward in providing for automated workload support tools that can be used on long-duration space missions as well as a tool for understanding the workload concept.