MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Abstract

This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data. By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.

Anthology ID:: 2020.acl-main.194
Volume:: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2020
Address:: Online
Editors:: Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2147–2157
Language:
URL:: https://aclanthology.org/2020.acl-main.194/
DOI:: 10.18653/v1/2020.acl-main.194
Bibkey:
Cite (ACL):: Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–2157, Online. Association for Computational Linguistics.
Cite (Informal):: MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (Chen et al., ACL 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.acl-main.194.pdf
Video:: http://slideslive.com/38929239
Code: GT-SALT/MixText + additional community code
Data: AG News, IMDb Movie Reviews

PDF Cite Search Code Video Fix data