Should You Fine-Tune BERT for Automated Essay Scoring?

Abstract

Most natural language processing research now recommends large Transformer-based models with fine-tuning for supervised classification tasks; older strategies like bag-of-words features and linear models have fallen out of favor. Here we investigate whether, in automated essay scoring (AES) research, deep neural models are an appropriate technological choice. We find that fine-tuning BERT produces similar performance to classical models at significant additional cost. We argue that while state-of-the-art strategies do match existing best results, they come with opportunity costs in computational resources. We conclude with a review of promising areas for research on student essays where the unique characteristics of Transformers may provide benefits over classical methods to justify the costs.

Anthology ID:: 2020.bea-1.15
Volume:: Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Month:: July
Year:: 2020
Address:: Seattle, WA, USA → Online
Editors:: Jill Burstein, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, Ildikó Pilán, Helen Yannakoudakis, Torsten Zesch
Venue:: BEA
SIG:: SIGEDU
Publisher:: Association for Computational Linguistics
Note:
Pages:: 151–162
Language:
URL:: https://aclanthology.org/2020.bea-1.15
DOI:: 10.18653/v1/2020.bea-1.15
Bibkey:
Cite (ACL):: Elijah Mayfield and Alan W Black. 2020. Should You Fine-Tune BERT for Automated Essay Scoring?. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 151–162, Seattle, WA, USA → Online. Association for Computational Linguistics.
Cite (Informal):: Should You Fine-Tune BERT for Automated Essay Scoring? (Mayfield & Black, BEA 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.bea-1.15.pdf

PDF Cite Search