Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2481492.2481495acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Microblog-genre noise and impact on semantic annotation accuracy

Published: 01 May 2013 Publication History

Abstract

Using semantic technologies for mining and intelligent information access to microblogs is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Semantic annotation of tweets is typically performed in a pipeline, comprising successive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). Consequently, errors are cumulative, and earlier-stage problems can severely reduce the performance of final stages. This paper presents a characterisation of genre-specific problems at each semantic annotation stage and the impact on subsequent stages. Critically, we evaluate impact on two high-level semantic annotation tasks: named entity detection and disambiguation. Our results demonstrate the importance of making approaches specific to the genre, and indicate a diminishing returns effect that reduces the effectiveness of complex text normalisation.

References

[1]
F. Abel, Q. Gao, G. Houben, and K. Tao. Semantic enrichment of twitter posts for user profile construction on the social web. The Semanic Web: Research and Applications, pages 375--389, 2011.
[2]
E. Amigó, A. Corujo, J. Gonzalo, E. Meij, and M. Rijke. Overview of Replab 2012: Evaluating online reputation management systems. In CLEF 2012 Labs and Workshop Notebook Papers, 2012.
[3]
K. Bontcheva and H. Cunningham. Semantic annotation and retrieval: Manual, semi-automatic and automatic generation. In J. Domingue, D. Fensel, and J. A. Hendler, editors, Handbook of Semantic Web Technologies. Springer, 2011.
[4]
K. Bontcheva and D. Rout. Making sense of social media streams through semantics: a survey. Semantic Web Journal, 2012.
[5]
S. Carter, W. Weerkamp, and E. Tsagkias. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal, 2013.
[6]
W. Cavnar and J. Trenkle. N-gram-based text categorization. In Proceedings of the Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, 1994.
[7]
M. Choudhury, R. Saraf, V. Jain, A. Mukherjee, S. Sarkar, and A. Basu. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition, 10(3):157--174, 2007.
[8]
S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of EMNLP-CoNLL, volume 6, pages 708--716, 2007.
[9]
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the Meeting of the Association for Computational Linguistics, 2002.
[10]
D. Damljanovic and K. Bontcheva. Named Entity Disambiguation using Linked Data. In Proceedings of the 9th Extended Semantic Web Conference (ESWC), 2012.
[11]
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st Conference on World Wide Web, pages 469--478, 2012.
[12]
L. Derczynski, B. Yang, and C. Jensen. Towards Context-Aware Search and Analysis on Social Media Data. In Proceedings of the 16th Conference on Extending Database Technology. ACM, 2013.
[13]
T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in Twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 80--88, 2010.
[14]
J. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363--370. Association for Computational Linguistics, 2005.
[15]
J. Foster, Ö. Çetinoglu, J. Wagner, J. Le Roux, S. Hogan, J. Nivre, D. Hogan, J. Van Genabith, et al. #hardtoparse: POS Tagging and Parsing the Twitterverse. In Proceedings of the AAAI Workshop On Analyzing Microtext, pages 20--25, 2011.
[16]
N. Freire, J. Borbinha, and P. Calado. An approach for named entity recognition in poorly structured data. The Semantic Web: Research and Applications, pages 718--732, 2012.
[17]
M. Greenwood, N. Aswani, and K. Bontcheva. Reputation Profiling with GATE. In CLEF 2012 Labs and Workshop Notebook Papers, 2012.
[18]
B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 368--378, 2011.
[19]
B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the conference on Empirical Methods in Natural Language Processing, pages 421--432. ACL, 2012.
[20]
M. Hepple. Independence and commitment: Assumptions for rapid training and execution of rule-based POS taggers. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 278--277. ACL, 2000.
[21]
A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the workshop on Web mining and social network analysis, pages 56--65. ACM, 2007.
[22]
X. Liu, M. Zhou, F. Wei, Z. Fu, and X. Zhou. Joint inference of named entity recognition and normalization for tweets. In Proceedings of the Association for Computational Linguistics, pages 526--535, 2012.
[23]
U. Lösch and D. Müller. Mapping microblog posts to encyclopedia articles. Lecture Notes in Informatics, 192(150), 2011.
[24]
M. Lui and T. Baldwin. langid. py: An off-the-shelf language identification tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012.
[25]
M. Magnani and L. Rossi. The ML-model for multi-layer social networks. In Proceedings of the conference on Advances in Social Networks Analysis and Mining, pages 5--12. IEEE, 2011.
[26]
M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313--330, 1993.
[27]
M. Marrero, S. Sanchez-Cuadrado, J. Lara, and G. Andreadakis. Evaluation of named entity extraction systems. Advances in Computational Linguistics, Research in Computing Science, 41:47--58, 2009.
[28]
D. Maynard, K. Bontcheva, and D. Rout. Challenges in developing opinion mining tools for social media. In Proceedings of the @NLP can u tag #usergeneratedcontent?! workshop, LREC, pages 15--22, 2012.
[29]
P. Mcnamee and J. Mayfield. Character n-gram tokenization for european language text retrieval. Information Retrieval, 7(1):73--97, 2004.
[30]
E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. In Proc. of the Fifth Int. Conf. on Web Search and Data Mining (WSDM), 2012.
[31]
P. Mendes, M. Jakob, A. García-Silva, and C. Bizer. DBpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, pages 1--8. ACM, 2011.
[32]
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.
[33]
D. Preotiuc-Pietro, S. Samangooei, T. Cohn, N. Gibbins, and M. Niranjan. Trendminer: An architecture for real time analysis of social media text. In Proceedings of the workshop on Real-Time Analysis and Mining of Social Streams, 2012.
[34]
D. Rao, P. McNamee, and M. Dredze. Entity linking: Finding extracted entities in a knowledge base. In Multi-source, Multi-lingual Information Extraction and Summarization. 2011.
[35]
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524--1534. ACL, 2011.
[36]
G. Rizzo and R. Troncy. Nerd: evaluating named entity recognition tools in the web of data. In Workshop on Web Scale Knowledge Extraction (WEKEX), pages 1--16, 2011.
[37]
A. Roberts, R. Gaizauskas, M. Hepple, and Y. Guo. Combining terminology resources and statistical methods for entity recognition: an evaluation. Proceedings of the conference on Language Resources and Evaluation, 2008.
[38]
D. Rout, K. Bontcheva, and M. Hepple. Reliably evaluating summaries of twitter timelines. In Proceedings of the AAAI Workshop on Analyzing Microtext, 2013.
[39]
W. Shen, J. Wang, P. Luo, and M. Wang. LINDEN: Linking named entities with knowledge base via semantic knowledge. In Proceedings of the 21st Conference on World Wide Web, pages 449--458, 2012.
[40]
R. Sproat, A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. Normalization of non-standard words. Computer Speech & Language, 15(3):287--333, 2001.
[41]
M. Stankovic, M. Rowe, and P. Laublet. Finding co-solvers on twitter, with a little help from linked data. The Semantic Web: Research and Applications, pages 39--55, 2012.
[42]
R. Steinberger, B. Pouliquen, M. Kabadjov, J. Belyaeva, and E. van der Goot. JRC-Names: A freely available, highly multilingual named entity resource. In Proceedings of the 8th International Conference in Recent Advances in Natural Language Processing, pages 104--110, 2011.
[43]
E. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh Conference on Natural Language Learning, pages 142--147. ACL, 2003.
[44]
K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the conference of the N. American Chapter of the Association for Computational Linguistics, pages 173--180, 2003.

Cited By

View all
  • (2023)Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification MethodsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358249622:6(1-32)Online publication date: 16-Feb-2023
  • (2022) Regional Variation and Syntactic Derivation of Low-frequency need -passives on Twitter Journal of English Linguistics10.1177/0075424221106697150:1(39-71)Online publication date: 1-Apr-2022
  • (2022)Bridging the Domain Gap for Stance Detection for the Zulu LanguageIntelligent Systems and Applications10.1007/978-3-031-16072-1_23(312-325)Online publication date: 31-Aug-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HT '13: Proceedings of the 24th ACM Conference on Hypertext and Social Media
May 2013
275 pages
ISBN:9781450319676
DOI:10.1145/2481492
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Twitter
  2. entity disambiguation
  3. entity recognition
  4. microblog
  5. semantic annotation
  6. text normalisation

Qualifiers

  • Research-article

Funding Sources

Conference

HT '13
Sponsor:

Acceptance Rates

HT '13 Paper Acceptance Rate 16 of 96 submissions, 17%;
Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification MethodsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358249622:6(1-32)Online publication date: 16-Feb-2023
  • (2022) Regional Variation and Syntactic Derivation of Low-frequency need -passives on Twitter Journal of English Linguistics10.1177/0075424221106697150:1(39-71)Online publication date: 1-Apr-2022
  • (2022)Bridging the Domain Gap for Stance Detection for the Zulu LanguageIntelligent Systems and Applications10.1007/978-3-031-16072-1_23(312-325)Online publication date: 31-Aug-2022
  • (2020) needs +PAST PARTICIPLE in regional Englishes on Twitter World Englishes10.1111/weng.1245139:1(119-134)Online publication date: 29-Jan-2020
  • (2020)Cultural Heritage Design Element Labeling System With GamificationIEEE Access10.1109/ACCESS.2020.30082708(127700-127708)Online publication date: 2020
  • (2019)Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from TweetsProceedings of the 30th ACM Conference on Hypertext and Social Media10.1145/3342220.3344929(283-284)Online publication date: 12-Sep-2019
  • (2019)Assessment of text coherence using an ontology‐based relatedness measurement methodExpert Systems10.1111/exsy.1250537:3Online publication date: 16-Dec-2019
  • (2019)Optimizing Semantic Annotations for Web Service InvocationIEEE Transactions on Services Computing10.1109/TSC.2016.261263212:4(590-603)Online publication date: 1-Jul-2019
  • (2019)Big Social Data - Predicting Users' Interests from their Social Networking Activities2019 International Conference on Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA)10.1109/CyberSA.2019.8899443(1-8)Online publication date: Jun-2019
  • (2019)ArcGIS Services Recommendation Based on Semantic and Heuristic Optimization AlgorithmCyberspace Data and Intelligence, and Cyber-Living, Syndrome, and Health10.1007/978-981-15-1922-2_34(487-501)Online publication date: 3-Dec-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media