Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3627673.3679120acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper
Open access

Covid19-twitter: A Twitter-based Dataset for Discourse Analysis in Sentence-level Sentiment Classification

Published: 21 October 2024 Publication History

Abstract

For the sentence-level sentiment classification task, learning Contrastive Discourse Relations (CDRs) like a-but-b is difficult for Deep Neural Networks (DNNs) via purely data-driven training. Several methods exist in the literature for dissemination of CDR information with DNNs, but there is no dedicated dataset available to effectively test their dissemination performance. In this paper, we propose a new large-scale dataset for this purpose called Covid19-twitter, which contains around 100k tweets symmetrically divided into various categories. Instead of manual annotation, we used a combination of an Emoji analysis and a lexicon-based tool called Valence Aware Dictionary and sEntiment Reasoner (VADER) to perform automatic labelling of the tweets, while also ensuring high accuracy of the annotation process through some quality checks. We also provide benchmark performances of several baselines on our dataset for both the sentiment classification and CDR dissemination tasks. We believe that this dataset will be valuable for discourse analysis research in sentiment classification.

References

[1]
Ritesh Agarwal, T. V. Prabhakar, and Sugato Chakrabarty. 2008. "I Know What You Feel": Analyzing the Role of Conjunctions in Automatic Sentiment Analysis. In Proceedings of the 6th International Conference on Advances in Natural Language Processing. Springer-Verlag, Berlin, Heidelberg, 28--39.
[2]
Mohammed Al-Shabi. 2020. Evaluating the performance of the most important Lexicons used to Sentiment analysis and opinions Mining. International Journal of Computer Science and Society.
[3]
Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alex Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Chris Ré, and Rob Malkin. 2019. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. In Proceedings of the 2019 International Conference on Management of Data.
[4]
David Jean Biau, Brigitte M Jolles, and Raphaël Porcher. 2010. P value and the theory of hypothesis testing: an explanation for new researchers. Clin. Orthop. Relat. Res. (2010).
[5]
John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics.
[6]
Tirtharaj Dash, Sharad Chitlangia, Aditya Ahuja, and Ashwin Srinivasan. 2022. A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports (2022).
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
[8]
Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, Matthäus Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management.
[9]
Artur S. d'Avila Garcez, Dov M. Gabbay, and Krysia B. Broda. 2002. Neural-Symbolic Learning System: Foundations and Applications. Springer-Verlag.
[10]
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford (2009).
[11]
Beatrice Grabowski. 2016. “P < 0.05” might not mean what you think: American statistical association ClarifiesPValues. J. Natl. Cancer Inst. (2016).
[12]
Shashank Gupta, Mohamed Reda Bouadjenek, and Antonio Robles-Kelly. 2023. A Mask-Based Logic Rules Dissemination Method for Sentiment Classifiers. In Advances in Information Retrieval.
[13]
Shashank Gupta, Mohamed Reda Bouadjenek, and Antonio Robles-Kelly. 2023. PERCY: A post-hoc explanation-based score for logic rule dissemination consistency assessment in sentiment classification. Knowledge-Based Systems (2023).
[14]
Shatha Ali A. Hakami, Robert Hendley, and Phillip Smith. 2022. Emoji Sentiment Roles for Sentiment Analysis: A Case Study in Arabic Texts. In Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP).
[15]
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531
[16]
Minqing Hu and Bing Liu. 2004. Mining and Summarizing Customer Reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[17]
Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. 2016. Harnessing Deep Neural Networks with Logic Rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 2410--2420.
[18]
C. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media (2014).
[19]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings.
[20]
Kalpesh Krishna, Preethi Jyothi, and Mohit Iyyer. 2018. Revisiting the Importance of Encoding Logic Rules in Sentiment Classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
[21]
Eyal Krupka and Naftali Tishby. 2007. Incorporating Prior Knowledge on Features into Learning. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research). PMLR.
[22]
Robin Lakoff. 1971. If's, And's and But's About Conjunction. In Studies in Linguistic Semantics, Charles J. Fillmore and D. Terence Langndoen (Eds.). Irvington, 3--114.
[23]
Rabindra Lamsal. 2020. Coronavirus (COVID-19) Tweets Dataset.
[24]
Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-collados. 2022. TimeLMs: Diachronic Language Models from Twitter. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
[25]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
[26]
Subhabrata Mukherjee and Pushpak Bhattacharyya. 2012. Sentiment Analysis in Twitter with Lightweight Discourse Analysis. In Proceedings of COLING 2012. The COLING 2012 Organizing Committee, Mumbai, India, 1847--1864.
[27]
Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. 2014. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. CoRR (2014).
[28]
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
[29]
Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05).
[30]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).
[31]
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08).
[32]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).
[33]
Filipe N. Ribeiro, Matheus Araújo, Pollyanna Gonccalves, Marcos André Gonccalves, and Fabrício Benevenuto. 2016. SentiBench - a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science (2016).
[34]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 1135--1144.
[35]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019.
[36]
Abu Awal Md Shoeb and Gerard de Melo. 2020. EmoTag1200: Understanding the Association between Emojis and Emotions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 8957--8967.
[37]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631--1642. https://www.aclweb.org/anthology/D13--1170
[38]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14--16, 2014, Conference Track Proceedings.
[39]
Duyu Tang. 2015. Sentiment-Specific Representation Learning for Document-Level Sentiment Analysis. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining.
[40]
Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology (2012).
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[42]
Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. 2023. Informed Machine Learning -- A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering (2023).
[43]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: generalized autoregressive pretraining for language understanding.
[44]
Byungkyu Yoo and Julia Taylor Rayz. 2021. Understanding Emojis for Sentiment Analysis. The International FLAIRS Conference Proceedings (2021).
[45]
Cäcilia Zirn, Mathias Niepert, Heiner Stuckenschmidt, and Michael Strube. 2011. Fine-Grained Sentiment Analysis with Structural Features. In Proceedings of 5th International Joint Conference on Natural Language Processing.

Index Terms

  1. Covid19-twitter: A Twitter-based Dataset for Discourse Analysis in Sentence-level Sentiment Classification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
      October 2024
      5705 pages
      ISBN:9798400704369
      DOI:10.1145/3627673
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 October 2024

      Check for updates

      Author Tags

      1. deep learning
      2. discourse analysis
      3. sentiment classification

      Qualifiers

      • Short-paper

      Conference

      CIKM '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 35
        Total Downloads
      • Downloads (Last 12 months)35
      • Downloads (Last 6 weeks)35
      Reflects downloads up to 16 Nov 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media