DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging

Sheng Chen, Akshay Soni, Aasish Pappu, Yashar Mehdad

Abstract

Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec – two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple k-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against state-of-the-art methods.

Anthology ID:: W17-2614
Volume:: Proceedings of the 2nd Workshop on Representation Learning for NLP
Month:: August
Year:: 2017
Address:: Vancouver, Canada
Editors:: Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, Scott Yih
Venue:: RepL4NLP
SIG:: SIGREP
Publisher:: Association for Computational Linguistics
Note:
Pages:: 111–120
Language:
URL:: https://aclanthology.org/W17-2614
DOI:: 10.18653/v1/W17-2614
Bibkey:
Cite (ACL):: Sheng Chen, Akshay Soni, Aasish Pappu, and Yashar Mehdad. 2017. DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 111–120, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):: DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging (Chen et al., RepL4NLP 2017)
Copy Citation:
PDF:: https://aclanthology.org/W17-2614.pdf

PDF Cite Search