When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems

Elias Stengel-Eskin, Emmanouil Antonios Platanios, Adam Pauls, Sam Thomson, Hao Fang, Benjamin Van Durme, Jason Eisner, Yu Su

Abstract

In natural language understanding (NLU) production systems, users’ evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation into this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on a small set of new symbols often decreases. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues.

Anthology ID:: 2022.emnlp-main.789
Volume:: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11473–11487
Language:
URL:: https://aclanthology.org/2022.emnlp-main.789
DOI:: 10.18653/v1/2022.emnlp-main.789
Bibkey:
Cite (ACL):: Elias Stengel-Eskin, Emmanouil Antonios Platanios, Adam Pauls, Sam Thomson, Hao Fang, Benjamin Van Durme, Jason Eisner, and Yu Su. 2022. When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11473–11487, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems (Stengel-Eskin et al., EMNLP 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.emnlp-main.789.pdf

PDF Cite Search