short-paper

Open access

Covid19-twitter: A Twitter-based Dataset for Discourse Analysis in Sentence-level Sentiment Classification

Authors:

Shashank Gupta,

Mohamed Reda Bouadjenek,

Antonio Robles-Kelly,

Thanh Thi Nguyen,

Dhananjay ThiruvadyAuthors Info & Claims

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pages 5370 - 5374

https://doi.org/10.1145/3627673.3679120

Published: 21 October 2024 Publication History

Abstract

For the sentence-level sentiment classification task, learning Contrastive Discourse Relations (CDRs) like a-but-b is difficult for Deep Neural Networks (DNNs) via purely data-driven training. Several methods exist in the literature for dissemination of CDR information with DNNs, but there is no dedicated dataset available to effectively test their dissemination performance. In this paper, we propose a new large-scale dataset for this purpose called Covid19-twitter, which contains around 100k tweets symmetrically divided into various categories. Instead of manual annotation, we used a combination of an Emoji analysis and a lexicon-based tool called Valence Aware Dictionary and sEntiment Reasoner (VADER) to perform automatic labelling of the tweets, while also ensuring high accuracy of the annotation process through some quality checks. We also provide benchmark performances of several baselines on our dataset for both the sentiment classification and CDR dissemination tasks. We believe that this dataset will be valuable for discourse analysis research in sentiment classification.

References

[1]

Ritesh Agarwal, T. V. Prabhakar, and Sugato Chakrabarty. 2008. "I Know What You Feel": Analyzing the Role of Conjunctions in Automatic Sentiment Analysis. In Proceedings of the 6th International Conference on Advances in Natural Language Processing. Springer-Verlag, Berlin, Heidelberg, 28--39.

Digital Library

[2]

Mohammed Al-Shabi. 2020. Evaluating the performance of the most important Lexicons used to Sentiment analysis and opinions Mining. International Journal of Computer Science and Society.

[3]

Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alex Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Chris Ré, and Rob Malkin. 2019. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. In Proceedings of the 2019 International Conference on Management of Data.

Digital Library

[4]

David Jean Biau, Brigitte M Jolles, and Raphaël Porcher. 2010. P value and the theory of hypothesis testing: an explanation for new researchers. Clin. Orthop. Relat. Res. (2010).

[5]

John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics.

[6]

Tirtharaj Dash, Sharad Chitlangia, Aditya Ahuja, and Ashwin Srinivasan. 2022. A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports (2022).

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

[8]

Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, Matthäus Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management.

Digital Library

[9]

Artur S. d'Avila Garcez, Dov M. Gabbay, and Krysia B. Broda. 2002. Neural-Symbolic Learning System: Foundations and Applications. Springer-Verlag.

[10]

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford (2009).

[11]

Beatrice Grabowski. 2016. “P < 0.05” might not mean what you think: American statistical association ClarifiesPValues. J. Natl. Cancer Inst. (2016).

[12]

Shashank Gupta, Mohamed Reda Bouadjenek, and Antonio Robles-Kelly. 2023. A Mask-Based Logic Rules Dissemination Method for Sentiment Classifiers. In Advances in Information Retrieval.

[13]

Shashank Gupta, Mohamed Reda Bouadjenek, and Antonio Robles-Kelly. 2023. PERCY: A post-hoc explanation-based score for logic rule dissemination consistency assessment in sentiment classification. Knowledge-Based Systems (2023).

[14]

Shatha Ali A. Hakami, Robert Hendley, and Phillip Smith. 2022. Emoji Sentiment Roles for Sentiment Analysis: A Case Study in Arabic Texts. In Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP).

[15]

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531

[16]

Minqing Hu and Bing Liu. 2004. Mining and Summarizing Customer Reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[17]

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. 2016. Harnessing Deep Neural Networks with Logic Rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 2410--2420.

[18]

C. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media (2014).

[19]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings.

[20]

Kalpesh Krishna, Preethi Jyothi, and Mohit Iyyer. 2018. Revisiting the Importance of Encoding Logic Rules in Sentiment Classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

[21]

Eyal Krupka and Naftali Tishby. 2007. Incorporating Prior Knowledge on Features into Learning. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research). PMLR.

[22]

Robin Lakoff. 1971. If's, And's and But's About Conjunction. In Studies in Linguistic Semantics, Charles J. Fillmore and D. Terence Langndoen (Eds.). Irvington, 3--114.

[23]

Rabindra Lamsal. 2020. Coronavirus (COVID-19) Tweets Dataset.

[24]

Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-collados. 2022. TimeLMs: Diachronic Language Models from Twitter. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.

[25]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

Digital Library

[26]

Subhabrata Mukherjee and Pushpak Bhattacharyya. 2012. Sentiment Analysis in Twitter with Lightweight Discourse Analysis. In Proceedings of COLING 2012. The COLING 2012 Organizing Committee, Mumbai, India, 1847--1864.

[27]

Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. 2014. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. CoRR (2014).

[28]

Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

[29]

Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05).

Digital Library

[30]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).

[31]

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08).

[32]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).

[33]

Filipe N. Ribeiro, Matheus Araújo, Pollyanna Gonccalves, Marcos André Gonccalves, and Fabrício Benevenuto. 2016. SentiBench - a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science (2016).

[34]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 1135--1144.

Digital Library

[35]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019.

[36]

Abu Awal Md Shoeb and Gerard de Melo. 2020. EmoTag1200: Understanding the Association between Emojis and Emotions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 8957--8967.

[37]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631--1642. https://www.aclweb.org/anthology/D13--1170

[38]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14--16, 2014, Conference Track Proceedings.

[39]

Duyu Tang. 2015. Sentiment-Specific Representation Learning for Document-Level Sentiment Analysis. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining.

Digital Library

[40]

Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology (2012).

[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

[42]

Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. 2023. Informed Machine Learning -- A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering (2023).

[43]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: generalized autoregressive pretraining for language understanding.

[44]

Byungkyu Yoo and Julia Taylor Rayz. 2021. Understanding Emojis for Sentiment Analysis. The International FLAIRS Conference Proceedings (2021).

[45]

Cäcilia Zirn, Mathias Niepert, Heiner Stuckenschmidt, and Michael Strube. 2011. Fine-Grained Sentiment Analysis with Structural Features. In Proceedings of 5th International Joint Conference on Natural Language Processing.

Index Terms

Covid19-twitter: A Twitter-based Dataset for Discourse Analysis in Sentence-level Sentiment Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Sentence-level Sentiment Classification with Weak Supervision
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Sentence-level sentiment classification is important to understand users' fine-grained opinions. Existing methods for sentence-level sentiment classification are mainly based on supervised learning. However, it is difficult to obtain sentiment labels of ...
Combining strengths, emotions and polarities for boosting Twitter sentiment analysis
WISDOM '13: Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining

Twitter sentiment analysis or the task of automatically retrieving opinions from tweets has received an increasing interest from the web mining community. This is due to its importance in a wide range of fields such as business and politics. People ...
Aspect-Based Twitter Sentiment Classification
ICTAI '13: Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence

Due to the popularity of Twitter, sentiment classification for Twitter has become a hot research topic. Previous studies have approached the problem as a tweet-level classification task where each tweet is classified as positive, negative or neutral. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

October 2024

5705 pages

ISBN:9798400704369

DOI:10.1145/3627673

General Chairs:
Edoardo Serra
Boise State University, USA
,
Francesca Spezzano
Boise State University, USA

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2024

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

CIKM '24

Sponsor:

SIGIR

CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

ID, Boise, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
123
Total Downloads

Downloads (Last 12 months)123
Downloads (Last 6 weeks)37

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten