Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

CrowdTC: Crowd-powered Learning for Text Classification

Published: 20 July 2021 Publication History

Abstract

Text classification is a fundamental task in content analysis. Nowadays, deep learning has demonstrated promising performance in text classification compared with shallow models. However, almost all the existing models do not take advantage of the wisdom of human beings to help text classification. Human beings are more intelligent and capable than machine learning models in terms of understanding and capturing the implicit semantic information from text. In this article, we try to take guidance from human beings to classify text. We propose Crowd-powered learning for Text Classification (CrowdTC for short). We design and post the questions on a crowdsourcing platform to extract keywords in text. Sampling and clustering techniques are utilized to reduce the cost of crowdsourcing. Also, we present an attention-based neural network and a hybrid neural network to incorporate the extracted keywords as human guidance into deep neural networks. Extensive experiments on public datasets confirm that CrowdTC improves the text classification accuracy of neural networks by using the crowd-powered keyword guidance.

References

[1]
Md. Shad Akhtar, Dushyant Singh Chauhan, and Asif Ekbal. 2020. A deep multi-task contextual attention framework for multi-modal affect analysis. ACM Transactions on Knowledge Discovery from Data 14, 3 (2020), 32:1–32:27.
[2]
Slobodan Beliga, Ana Mestrovic, and Sanda Martincic-Ipsic. 2016. Selectivity-based keyword extraction method. International Journal on Semantic Web and Information Systems 12, 3 (2016), 1–26.
[3]
Adrien Bougouin, Florian Boudin, and Béatrice Daille. 2013. TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 543–551. Retrieved from https://www.aclweb.org/anthology/I13-1062/.
[4]
Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes, and Adam Jatowt. 2020. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences 509 (2020), 257–289.
[5]
Cornelia Caragea, Florin Adrian Bulgarov, Andreea Godea, and Sujatha Das Gollapalli. 2014. Citation-enhanced keyphrase extraction from research papers: A supervised approach. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1435–1446.
[6]
William B. Cavnar and John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval. 161–175. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3248&rep=rep1&type=pdf.
[7]
Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 8 (1995), 790–799.
[8]
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation. 103–111.
[9]
Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann LeCun. 2017. Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 1107–1116.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
[11]
Samhaa R. El-Beltagy and Ahmed A. Rafea. 2009. KP-Miner: A keyphrase extraction system for english and arabic documents. Inf. Syst. 34, 1 (2009), 132–144.
[12]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 226–231. Retrieved from http://www.aaai.org/Library/KDD/1996/kdd96-037.php.
[13]
Amber Feng, Michael J. Franklin, Donald Kossmann, Tim Kraska, Samuel Madden, Sukriti Ramesh, Andrew Wang, and Reynold Xin. 2011. CrowdDB: Query processing with the VLDB crowd. In Proceedings of the VLDB Endowment . 4, 12 (2011), 1387–1390. Retrieved from http://www.vldb.org/pvldb/vol4/p1387-feng.pdf.
[14]
Ryan J. Gallagher, Kyle Reing, David C. Kale, and Greg Ver Steeg. 2017. Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics 5, 5 (2017), 529–542. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/1244.
[15]
Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and Techniques, 3rd Edition. Morgan Kaufmann. Retrieved from http://hanj.cs.illinois.edu/bk3/.
[16]
Kazi Saidul Hasan and Vincent Ng. 2014. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 1262–1273.
[17]
Swapnil Hingmire, Sandeep Chougule, Girish K. Palshikar, and Sutanu Chakraborti. 2013. Document classification by topic labeling. In Proceedings of the International Conference on Information Retrieval SIGIR. 877–880.
[18]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[19]
Karen Spärck Jones. 2004. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 60, 5 (2004), 493–502.
[20]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1746–1751.
[21]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations. Retrieved from http://arxiv.org/abs/1412.6980.
[22]
Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura E. Barnes, and Donald E. Brown. 2019. Text classification algorithms: A survey. Information 10, 4 (2019), 150.
[23]
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2267–2273. Retrieved from http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745.
[24]
Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J. Franklin. 2016. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296–2319.
[25]
Christopher H. Lin, Mausam, and Daniel S. Weld. 2016. Re-active learning: Active learning with relabeling. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. 1845–1852. Retrieved from http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12500.
[26]
Bing Liu and Ian Lane. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. In Interspeech, Nelson Morgan (Ed.). ISCA, 685–689.
[27]
Adam Marcus, Eugene Wu, Samuel Madden, and Robert C. Miller. 2011. Crowdsourced databases: Query processing with people. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research. 211–214. Retrieved from http://cidrdb.org/cidr2011/Papers/CIDR11_Paper29.pdf.
[28]
Olena Medelyan, Eibe Frank, and Ian H. Witten. 2009. Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 1318–1327. Retrieved from https://www.aclweb.org/anthology/D09-1137/.
[29]
Dheeraj Mekala and Jingbo Shang. 2020. Contextualized weak supervision for text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 323–333.
[30]
Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. 2017. Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 582–592.
[31]
Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, Yu Zhang, and Jiawei Han. 2020. Discriminative topic mining via category-name guided text embedding. In Proceedings of The Web Conference 2020. 2121–2132.
[32]
Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-supervised neural text classification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 983–992.
[33]
Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. Weakly-supervised hierarchical text classification. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 6826–6833.
[34]
Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Heng Ji, Chao Zhang, and Jiawei Han. 2020. Text classification using label names only: A language model self-training approach. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 9006–9017.
[35]
Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 404–411. Retrieved from https://www.aclweb.org/anthology/W04-3252/.
[36]
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learningbased text classification: A comprehensive review.ACM Computing Surveys 54, 3 (2021), 62:1–62:40.
[37]
Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: A case for active learning. In Proceedings of the VLDB Endowment 8, 2 (2014), 125–136.
[38]
Thuy Dung Nguyen and Min-Yen Kan. 2007. Keyphrase extraction in scientific publications. In Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers. 317–326.
[39]
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. 79–86.
[40]
Eirini Papagiannopoulou and Grigorios Tsoumakas. 2020. A review of keyphrase extraction. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 10, 2 (2020).
[41]
Hyunjung Park, Richard Pang, Aditya G. Parameswaran, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom. 2012. Deco: A system for declarative crowdsourcing. In Proceedings of the VLDB Endowment . 5, 12 (2012), 1990–1993.
[42]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532–1543.
[43]
Matt Post and Shane Bergsma. 2013. Explicit and implicit syntactic features for text classification. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics ACL Short Papers. 866–872. Retrieved from https://www.aclweb.org/anthology/P13-2150/.
[44]
Chao Qiao, Bo Huang, Guocheng Niu, Daren Li, Daxiang Dong, Wei He, Dianhai Yu, and Hua Wu. 2018. A new method of region embedding for text classification. In ICLR. Retrieved from https://openreview.net/forum?id=BkSDMA36Z.
[45]
Douglas LT Rohde, Laura M Gonnerman, and David C Plaut. 2006. An improved model of semantic similarity based on lexical co-occurrence. Commun. ACM 8, 627-633 (2006), 116. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.131.9401&rep=rep1&type=pdf.
[46]
Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1 (2010), 1–20.
[47]
Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis Machine Intelligence 22, 8 (2000), 888–905.
[48]
Yangqiu Song and Dan Roth. 2014. On dataless hierarchical text classification. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. 1579–1585. Retrieved from http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8588.
[49]
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification? In CCL. 194–206.
[50]
Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1422–1432.
[51]
Norases Vesdapunt, Kedar Bellare, and Nilesh N. Dalvi. 2014. Crowdsourcing algorithms for entity resolution. In Proceedings of the VLDB Endowment . 7, 12 (2014), 1071–1082.
[52]
Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the 23 rd International Conference on Machine Learning. 977–984.
[53]
Xiaojun Wan and Jianguo Xiao. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. 855–860. Retrieved from http://www.aaai.org/Library/AAAI/2008/aaai08-136.php.
[54]
Baoxin Wang. 2018. Disconnected recurrent neural networks for text categorization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics ACL. 2311–2320.
[55]
Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao, and Lawrence Carin. 2018. Joint embedding of words and labels for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2321–2331.
[56]
Jin Wang, Liang-Chih Yu, K. Robert Lai, and Xuejie Zhang. 2016. Dimensional sentiment analysis using a regional CNN-LSTM model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: 2321–2331.
[57]
Sibo Wang, Xiaokui Xiao, and Chun-Hee Lee. 2015. Crowd-based deduplication: An adaptive approach. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1263–1277.
[58]
Peter Welinder and Pietro Perona. 2010. Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops CVPR Workshops. 25–32.
[59]
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. 1999. KEA: Practical automatic keyphrase extraction. In Proceedings of the 4th ACM conference on Digital libraries ACM DL. 254–255.
[60]
David H. Wolpert and William G. Macready. 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1, 1 (1997), 67–82.
[61]
Tingmin Wu, Shigang Liu, Jun Zhang, and Yang Xiang. 2017. Twitter spam detection based on deep learning. In Proceedings of the Australasian Computer Science Week Multiconference. 3:1–3:8.
[62]
Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. arxiv:1602.00367 Retrieved from http://arxiv.org/abs/1602.00367.
[63]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd Conference on Neural Information Processing Systems. 5754–5764. Retrieved from http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.
[64]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies NAACL–HLT. 1480–1489.
[65]
Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of CNN and RNN for natural language processing. arxiv:1702.01923 Retrieved from http://arxiv.org/abs/1702.01923.
[66]
Dani Yogatama, Chris Dyer, Wang Ling, and Phil Blunsom. 2017. Generative and discriminative text classification with recurrent neural networks. arxiv:1703.01898 Retrieved from http://arxiv.org/abs/1703.01898.
[67]
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems. 649–657. Retrieved from http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.

Cited By

View all
  • (2024)Challenges and Opportunities of Text-Based Emotion Detection: A SurveyIEEE Access10.1109/ACCESS.2024.335635712(18416-18450)Online publication date: 2024
  • (2023)Crowdsourcing Truth Inference via Reliability-Driven Multi-View Graph EmbeddingACM Transactions on Knowledge Discovery from Data10.1145/356557617:5(1-26)Online publication date: 27-Feb-2023
  • (2023)Robust Multimodal Sentiment Analysis via Tag Encoding of Uncertain Missing ModalitiesIEEE Transactions on Multimedia10.1109/TMM.2022.320757225(6301-6314)Online publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 1
February 2022
475 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3472794
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2021
Accepted: 01 March 2021
Revised: 01 March 2021
Received: 01 November 2020
Published in TKDD Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Text classification
  2. crowdsourcing
  3. keyword extraction
  4. neural networks

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Key R&D Program of China
  • NSFC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Challenges and Opportunities of Text-Based Emotion Detection: A SurveyIEEE Access10.1109/ACCESS.2024.335635712(18416-18450)Online publication date: 2024
  • (2023)Crowdsourcing Truth Inference via Reliability-Driven Multi-View Graph EmbeddingACM Transactions on Knowledge Discovery from Data10.1145/356557617:5(1-26)Online publication date: 27-Feb-2023
  • (2023)Robust Multimodal Sentiment Analysis via Tag Encoding of Uncertain Missing ModalitiesIEEE Transactions on Multimedia10.1109/TMM.2022.320757225(6301-6314)Online publication date: 1-Jan-2023
  • (2023)Global triangle estimation based on first edge sampling in large graph streamsThe Journal of Supercomputing10.1007/s11227-023-05205-379:13(14079-14116)Online publication date: 3-Apr-2023
  • (2022)TIRA: Truth Inference via Reliability Aggregation on Object-Source GraphIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322530835:11(11967-11981)Online publication date: 29-Nov-2022
  • (2022)Recruitment Fraud Detection Method Based on Crowdsourcing and Multi-feature Fusion2022 5th International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD55127.2022.9820055(267-273)Online publication date: 27-May-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media