Abstract
We describe preliminary evaluation data for ATHENA (Appraisal of Task Health and Effort through Non-intrusive Assessments), a completely no contact, zero-intrusion workload measurement method which harnesses multimodal metrics (e.g. linguistic markers, keyboard dynamics and computer vision). Preliminary results reflect the existence of different types of workload, with our zero-intrusion metrics demonstrating respectable classification accuracies when the variable causing workload (e.g. time) is matched with the type of workload assessed (e.g. temporal). By not requiring extra equipment or interrupting workflow, ATHENA represents a valuable step forward in providing automated workload support tools as well as a tool for understanding the workload concept.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Human factors
- Zero-Intrusion workload measure
- Linguistic analysis
- Cognitive workload
- Machine learning
- Keyboard dynamics
1 Introduction
The current work discusses the initial validation and preliminary results for ATHENA (Appraisal of Task Health and Effort through Non-Intrusive Assessments); a workload sensor with the ability to automatically assess and evaluate human workload. Once workload level is known performance can be optimized through adaptable automation [1] and task scheduling. Our machine learning enabled software sensor uses a variety of human behavioral features (such as linguistic analysis, keyboard dynamics and computer vision), all obtained with zero-intrusion and at little cost since the underlying behaviors contributing to the metrics are naturally exhibited during task completion.
ATHENA is ideally suited for NASA’s expected long duration space missions as well as other high criticality domains due to its zero-intrusion nature and use of a variety of metrics. By collecting naturally occurring behavioral metrics as well as information obtained through no contact sensors, ATHENA allows workload estimates to be obtained without modification or interruption of workflow, which can affect the crew’s workload and confound results (as seen with self-reports or additional equipment attached to the operator [2]). The variety of metrics allow an appropriate subset to be applied as the current context allows. This is important given the wide variety and multimodal nature of tasks will make some metrics useful during some tasks but not others (e.g. keyboard dynamics are not useful if typing is not occurring). We have shown a subset of our metrics can provide accurate classification, but the best classification rates are obtained when all available metrics are used [3].
2 Materials and Methods
2.1 Surveys
Ground truth workload data was obtained through surveys administered after the completion of each game. We used the Bedford Scale as a uni-dimensional rating scale to measure spare mental capacity [4]. The hierarchical scale guides users through a ten point decision tree, with each point having an accompanying descriptor of the workload level. For classification purposes we divided the Bedford into 4 levels following natural divisions provided by the scale itself: 1–3, 4–6, 7–9, and 10. We also used the NASA-TLX as a multi-dimensional rating scale to provide additional diagnostic information about experienced workload [5]. We divided the TLX into three levels such that 33 % of the data fell into each of a high, medium, and low category. This was done to provide us with the discrete categories needed for classification and maintain the nature of the TLX as a way to determine relative workload levels.
2.2 Procedure
We developed a NASA relevant testbed scenario by reframing the codebreaking game of Mastermind [6] as a task for astronauts performing a wiring reconfiguration to support the docking of future commercial crew and cargo vehicles. Nine professionals in the Minneapolis, MN area played the game. Linguistic data was collected through “mission control” texts using a pre-defined protocol, allowing for the collection of structured text and keyboard dynamic data, as well as unstructured linguistics via “think aloud” [7]. We believe workload is a multi-dimensional concept for which different types of workload can be manipulated independently. Therefore, Workload was manipulated ‘cognitively’ by using different in-game feedback mechanisms (that either made the task harder or easier) and requiring memory usage for previous guesses; and ‘temporally’ by adding various time constraints. The effect of a consistent audio background noise recorded from onboard ISS was also explored, see Fig. 1.
We developed proprietary software to collect and analyze keyboard and mouse dynamics, and augmented techniques available in open source software to derive heart rate using the RGB video stream. The collected data was processed to obtain desired metrics such as heart rate [8], typing pauses and errors [9], as well as task performance [10]. Metrics were chosen based on a literature review and internal brainstorming.
3 Results
We used Simple Linear Regression as our supervised machine learning approach, via an interface with the WEKA toolkit [11], to classify each game played. Each participant played six games, each game divided into thirds for analysis. We performed 10-fold cross-validation using the survey scores as classification targets. We expected the total TLX to best classify all games, the Bedford and TLX mental subscale to best classify our Mental Low/Baseline/High conditions, and the TLX temporal subscale to best classify our Baseline/25/45 time limits. Noise was included as an exploratory variable.
Under these assumptions our classification accuracies ranged from 57 %–100 % (lower bound was 48 % accuracy when looking at a full cross between game type and survey result used for classification, but 75 % when removing the all games assumption), see Fig. 2. The TLX mental subscale (78 %), TLX total (100 %), and Bedford (79 %) had the highest classification accuracies for our Mental Low/Baseline/High conditions. The TLX temporal subscale (75 %) had the highest classification accuracy for our Baseline/25/45 time limits.
To determine if the other TLX subscales could be of additional diagnostic value, we completed classifications using each TLX subscale, see Fig. 3. The TLX mental (78 %), frustration (100 %), and performance (88 %) subscales had the highest classification accuracies for our Mental Low/Baseline/High conditions. The TLX temporal subscale had the highest classification accuracies for our Baseline/25/45 time limits. The TLX effort (85 %) and physical (68 %) subscales had the highest classification accuracies for our Baseline/Noise conditions.
4 Summary and Conclusions
Our preliminary results reflect the existence of different types of workload, with our zero-intrusion metrics demonstrating respectable classification accuracies when the variable causing workload (e.g. time) is matched with the type of workload assessed (e.g. temporal). For example, the TLX temporal subscale best classifies manipulations due to temporal demands but performs the worst for mental manipulations, as would be expected. Also, the similar classification accuracies for the Bedford and TLX mental subscale support the Bedford as a cognitive workload scale, while the increased classification accuracy seen with the TLX total supports the idea of workload being more complex than purely cognitive in nature and our ‘cognitive’ manipulations were not purely cognitive in terms of workload dimensions. Finally, our results indicate noise affected the perceived effort and physical aspects of workload, while our cognitive manipulations affected perceived mental, frustration and performance aspects of workload. Using our zero-intrusion metrics, no survey or subscale produced highest accuracy levels for all our conditions combined (i.e. all games) even though there were marked differences in subjective reporting, see Fig. 4. While our work points to the existence of different workload types more work is needed to fully understand them and the multi-dimensional concept that is workload.
Overall, ATHENA has demonstrated that accurate assessments of workload can be achieved by a sensor that solely utilizes zero-intrusion metrics. Thus, ATHENA represents a valuable step forward in providing for automated workload support tools that can be used on long-duration space missions as well as a tool for understanding the workload concept.
References
Miller, C.A., Funk, H., Goldman, R., Meisner, J., Wu, P.: Implications of adaptive vs. adaptable UIs on decision making: Why “automated adaptiveness” is not always the right answer. In: Proceedings of the 1st International Conference on Augmented Cognition, Las Vegas (2005)
Chen, F., Ruiz, N., Choi, E., Epps, J., Khawaja, M.A., Taib, R., Yin, B., Wang, Y.: Multimodal behavior and interaction as indicators of cognitive load. ACM Trans. Interact. Intell. Syst. (TiiS) 2(4), 22 (2012)
Wu, P., Ott, T., Paullada, A., Mayer, D., Gottlieb, J., Wall, P.: Inclusion of linguistic features to a zero-intrusion workload assessment technique. In: Proceedings of the 7th AHFE Conference, 27–31 July 2016. CRC Press, Inc. (accepted)
Roscoe, A.H: Assessing pilot workload in flight. In: AGARD Conference Proceedings Flight Test Techniques, Paris (1984)
Hart, S.G., Staveland, L.E.: Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In: Hancock, P.A., Meshkati, N. (eds.) Human Mental Workload. North Holland Press, Amsterdam (1988)
Meirowitz, M.: Mastermind (1970)
van Someren, M.W., Barnard, Y.F., Sandberg, J.A.C.: The Think Aloud Method: A Practical Guide to Modelling Cognitive Processes. Academic Press, London (1994)
Miller, S.: Literature review workload measures. Document ID: N01-006. National Advanced Driving Simulator. http://www.nads-sc.uiowa.edu/publicationStorage/200501251347060.N01-006.pdf (2001). Accessed 7 Jan 2014
Vizer, L.M., Zhou, L., Sears, A.: Automated stress detection using keystroke and linguistic features: An exploratory study. Int. J. Hum Comput Stud. 67(10), 870–886 (2009)
Tsang, P.S., Vidulich, M. A.: Mental workload and situation awareness. In: Handbook of Human Factors and Ergonomics (2006)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 37–57 (2009)
Acknowledgments
Concepts described above were developed with support from US AF (Contract # FA8650-06-C-6635), NIST (Contract # 70NANB0H3020), ONR (Contract # N00014-09-C-0265) and NASA (Contract # NNX12AB40G). ATHENA was sponsored by NASA SBIR (Contract # NNX15CJ18P), undertaken by SIFT, LLC. We would like to thank Mai Lee Chang, Kristina Holden, Brian Gore, Gordon Voss, Aniko Sandor, Alexandra Whitmire, and Mihriban Whitmore for oversight, guidance, and support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ott, T., Wu, P., Paullada, A., Mayer, D., Gottlieb, J., Wall, P. (2016). ATHENA – A Zero-Intrusion No Contact Method for Workload Detection Using Linguistics, Keyboard Dynamics, and Computer Vision. In: Stephanidis, C. (eds) HCI International 2016 – Posters' Extended Abstracts. HCI 2016. Communications in Computer and Information Science, vol 617. Springer, Cham. https://doi.org/10.1007/978-3-319-40548-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-40548-3_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40547-6
Online ISBN: 978-3-319-40548-3
eBook Packages: Computer ScienceComputer Science (R0)