Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Normalization of Abbreviation and Acronym on Microtext in Bahasa Indonesia by Using Dictionary-Based and Longest Common Subsequence (LCS)

Published: 01 January 2019 Publication History

Abstract

The communication nowadays has reached a need to express the idea in short text. This kind of communication is delivered in various media such as short messages service (SMS), Facebook status, Twitter post, chat messages, comments, and any form of short text. These various kinds of short text are known as microtext. The microtext usually has one sentence or less, written informally, consists of abbreviations, acronyms, emoticons, hashtags, and others. These features of the microtext become a particular challenge to the text classification. These features cannot be processed directly as in the traditional text processing, because it may lead to inaccuracy. Therefore, it requires microtext normalization to transform these features into well-written texts before applying text processing. This research aims to normalize some of these features, which are abbreviations and acronyms. The normalization applied dictionary-based and longest common subsequence (LCS) approach to the microtext in Bahasa Indonesia. Dictionary-based has shown an excellenct performance instead of LCS. However, it is limited to pre-defined abbreviations and acronyms. Besides, the LCS offers dynamic microtext normalization. Nevertheless, the combination of both approaches increases normalization performance slightly.

References

[1]
Rosa KD, Ellen J. Text Classification Methodologies Applied to Micro-Text in Military Chat. In: 2009 International Conference on Machine Learning and Applications; 2009. p. 710-714.
[2]
Hanafiah N, Kevin A, Sutanto C, Fiona, Arifin Y, Hartanto J. Text Normalization Algorithm on Twitter in Complaint Category. Procedia Computer Science. 2017;116:20 - 26. Discovery and innovation of computer science technology in artificial intelligence era: The 2nd International Conference on Computer Science and Computational Intelligence (ICCSCI 2017). Available from: http://www.sciencedirect.com/science/article/pii/S1877050917320410.
[3]
Doval Y, Vilares M, Vilares J. On the performance of phonetic algorithms in microtext normalization. Expert Systems with Applications. 2018;113:213-222. Available from: http://www.sciencedirect.com/science/article/pii/S0957417418304305.
[4]
Gunawan D, Siregar RP, Rahmat RF, Amalia A. Building automatic customer complaints filtering application based on Twitter in Bahasa Indonesia. Journal of Physics: Conference Series. 2018 mar;978:012119. Available from: https://doi.org/10.10887.2F1742-65967,2F9787.2F17.2F012119.
[5]
Gunawan D, Rahmat RF, Putra A, Pasha MF. Filtering Spam Text Messages by Using Twitter-LDA Algorithm. In: 2018 IEEE International Conference on Communication, Networks and Satellite (Comnetsat). IEEE; 2018. p. 1-6. Available from: https://ieeexplore.ieee.org/document/8684085/.
[6]
Gunawan D, Amalia A, Charisma I. Clustering articles in bahasa Indonesia using self-organizing map. In: 2017 International Conference on Electrical Engineering and Informatics (ICELTICs); 2017. p. 239-244.
[7]
Amalia A, Gunawan D, Najwan A, Meirina F. Focused crawler for the acquisition of health articles. In: 2016 International Conference on Data and Software Engineering (ICoDSE); 2016. p. 1-6.
[8]
R Satapathy, C Guerreiro, I Chaturvedi, E. Cambria, Phonetic-Based Microtext Normalization for Twitter Sentiment Analysis. In: 2017 IEEE International Conference on Data Mning Workshops (ICDMW) (2017) 407–413.
[9]
Xue Z, Yin D, Davison BD. Normalizing Microtext. In: Proceedings of the 5th AAAI Conference on Analyzing Microtext. AAAIWS’ 11-05. AAAI Press; 2011. p. 74-79. Available from: http://dl.acm.org/citation.cfm?id=2908630.2908643.
[10]
Khoury R. Microtext normalization using probably-phonetically-similar word discovery. In: 2015 IEEE 11th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob); 2015. p. 384-391.
[11]
Doval Y, Vilares J, Gómez-Rodríguez C. LYSGROUP: Adapting a Spanish microtext normalization system to English. In: NUT@IJCNLP;2015. p. 99-105.

Index Terms

  1. Normalization of Abbreviation and Acronym on Microtext in Bahasa Indonesia by Using Dictionary-Based and Longest Common Subsequence (LCS)
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Procedia Computer Science
    Procedia Computer Science  Volume 161, Issue C
    2019
    1341 pages
    ISSN:1877-0509
    EISSN:1877-0509
    Issue’s Table of Contents

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 January 2019

    Author Tags

    1. microtext normalization
    2. abbreviation
    3. acronym
    4. dictionary-based normalization
    5. longest common subsequence normalization

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media