Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification
DOI:
https://doi.org/10.1609/aaai.v38i16.29786Keywords:
NLP: Applications, ML: Transfer, Domain Adaptation, Multi-Task Learning, NLP: Lexical Semantics and Morphology, NLP: Sentence-level Semantics, Textual Inference, etc.Abstract
Euphemisms are commonly used on social media and darknet marketplaces to evade platform regulations by masking their true meanings with innocent ones. For instance, “weed” is used instead of “marijuana” for illicit transactions. Thus, euphemism identification, i.e., mapping a given euphemism (“weed”) to its specific target word (“marijuana”), is essential for improving content moderation and combating underground markets. Existing methods employ self-supervised schemes to automatically construct labeled training datasets for euphemism identification. However, they overlook the text-text domain gap caused by the discrepancy between the constructed training data and the test data, leading to performance deterioration. In this paper, we present the text-text domain gap and explain how it forms in terms of the data distribution and the cone effect. Moreover, to bridge this gap, we introduce a feature alignment network (FA-Net), which can both align the in-domain and cross-domain features, thus mitigating the domain gap from training data to test data and improving the performance of the base models for euphemism identification. We apply this FA-Net to the base models, obtaining markedly better results, and creating a state-of-the-art model which beats the large language models.Downloads
Published
2024-03-24
How to Cite
Hu, Y., Li, J., Wu, M., Huang, Z., Chen, G., & Sha, Y. (2024). Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18270-18278. https://doi.org/10.1609/aaai.v38i16.29786
Issue
Section
AAAI Technical Track on Natural Language Processing I