Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification

Authors

  • Yuxue Hu College of Informatics, Huazhong Agricultural University, Wuhan, China. Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China
  • Junsong Li College of Informatics, Huazhong Agricultural University, Wuhan, China.
  • Mingmin Wu College of Informatics, Huazhong Agricultural University, Wuhan, China.
  • Zhongqiang Huang College of Informatics, Huazhong Agricultural University, Wuhan, China.
  • Gang Chen Jointown Healthcare Technoloty Group, Wuhan, China.
  • Ying Sha College of Informatics, Huazhong Agricultural University, Wuhan, China. Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China

DOI:

https://doi.org/10.1609/aaai.v38i16.29786

Keywords:

NLP: Applications, ML: Transfer, Domain Adaptation, Multi-Task Learning, NLP: Lexical Semantics and Morphology, NLP: Sentence-level Semantics, Textual Inference, etc.

Abstract

Euphemisms are commonly used on social media and darknet marketplaces to evade platform regulations by masking their true meanings with innocent ones. For instance, “weed” is used instead of “marijuana” for illicit transactions. Thus, euphemism identification, i.e., mapping a given euphemism (“weed”) to its specific target word (“marijuana”), is essential for improving content moderation and combating underground markets. Existing methods employ self-supervised schemes to automatically construct labeled training datasets for euphemism identification. However, they overlook the text-text domain gap caused by the discrepancy between the constructed training data and the test data, leading to performance deterioration. In this paper, we present the text-text domain gap and explain how it forms in terms of the data distribution and the cone effect. Moreover, to bridge this gap, we introduce a feature alignment network (FA-Net), which can both align the in-domain and cross-domain features, thus mitigating the domain gap from training data to test data and improving the performance of the base models for euphemism identification. We apply this FA-Net to the base models, obtaining markedly better results, and creating a state-of-the-art model which beats the large language models.

Published

2024-03-24

How to Cite

Hu, Y., Li, J., Wu, M., Huang, Z., Chen, G., & Sha, Y. (2024). Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18270-18278. https://doi.org/10.1609/aaai.v38i16.29786

Issue

Section

AAAI Technical Track on Natural Language Processing I