Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

News Article Text Classification in Indonesian Language

Published: 01 November 2017 Publication History

Abstract

This research intends to find the appropriate algorithm to automatically classify a news article in Indonesian Language. We obtain our dataset which is taken by using a web crawling method from www.cnnindonesia.com. First of all, the document will first undergo some Text Preprocessing method in the form of Lemmatization and Stopwords Removal. The reason we are doing the Text Preprocessing step before anything else is to minimize the noise in the document. Next, we apply Feature Selection onto the document to further separate important words and less important words inside the document. After applying Feature Selection, the document will be classified by the classifier. We are comparing the TF-IDF and SVD algorithm for feature selection, while also comparing the Multinomial Nave Bayes, Multivariate Bernoulli Nave Bayes, and Support Vector Machine for the Classifiers. Based on the test results, the combination of TF-IDF and Multinomial Nave Bayes Classifier gives the highest result compared to the other algorithms, which precision is 0.9841519 and its recall is 0.9840000. The result outperform the previous similar study that classify news article in Indonesian language which obtained 85% of accuracy.

References

[1]
Yang CC, Chen H, Hong K. Decision Support Systems. Visualization of large category map for Internet browsing. 2003;: p. 89-102.
[2]
Wong AH, Abednego L. Pengelompokan dokumen otomatis dengan menggunakan TFIDf classifier, naive bayes classifier dan KNN. http://library.unpar.ac.id/index.php?p=show_detail&id=204170#. 2015;: p. 1-120.
[3]
Aggarwal CC, Zhai C. A Survey of Text Classification Algorithms. 2013;: p. 169-170.
[4]
W Zhang, T Yoshida, X Tang, A comparative study of TF* IDF, LSI and multi-words for text classification, Expert Systems with Applications, 38 (2011 March) 2758-2765.
[5]
Wang W, Carreira-Perpinan MA. The Role of Dimensionality Reduction in Classication. In Conference on Artificial Intelligence; 2015; California: AAAI. p. 2128-2134.
[6]
Zareapoor M, K.R. S. Feature Extraction or Feature Selection for Text Classification: A Case Study on Phising Email Detection. New Delhi:; 2015.
[7]
B Trstenjak, S Mikac, D Donko, KNN with TF-IDF based Framework for Text Categorization, Procedia Engineering, 69 (2014) 1356-1364.
[8]
Trivedi M, Sharma S, Soni N, Nair S. Comparison of Text Classification Algorithms. International Journal of Engineering Research & Technology. 2015 February; 4(2): p. 334-336.
[9]
Rennie JD, Shih L, Teevan J, Karger DR. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03); 2003. p. 616-623.
[10]
Ramdass D, Seshasai S. Document Classification for Newspaper Articles.; 2009.
[11]
Liliana DY, Hardianto A, Ridok M. Indonesian News Classification using Support. International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:5. 2011;: p. 1015-1018.
[12]
Sharma S. Web Crawler. International Journal of Advanced Research in Computer Science and Software Engineering. 2014 April; 4(4): p. 1379-1381.
[13]
Hs. W. Kalimat. In Hs. W. Bahasa Indonesia, Mata Kuliah Pengembangan Kepribadian di Perguruan Tinggi. Jakarta: Grasindo; 2012. p. 146-147.
[14]
Plansangket S, Gan JQ. A query suggestion method combining TF-IDF and Jaccard Coefficient for interactive web search. Artificial Intelligence Research, Volume 4 No 2, ISSN 1927-6982. 2015;: p. 119-125.
[15]
Rajaraman A, Ullman JD, Leskovec J. Mining of Massive Datasets; 2011.
[16]
Vangelis M, Androutsopoulos I, Paliouras G. Spam Filtering with Naive Bayes Which Naive Bayes? Athens:; 2006.
[17]
ASHKMT Davood Mahmoodi, FPGA Simulation of Linear and Nonlinear Support Vector Machine, Journal of Software Engineering and Applications, 4 (2011 May) 320-328.
[18]
Milgram J, Cheriet M, Sabourin R. One Against One or One Against All: Which One is Better for Handwriting Recognition with SVMs. In Tenth International Workshop on Frontiers in Handwriting Recognition; 2006; La Baule.
[19]
Raschka S. Naive Bayes and Text Classification: Introduction and Theory. Cornell University Library. 2014 October: p. 1-20.
[20]
P Gamallo, S Bordag, Is singular value decomposition useful for word similarity extraction, Language Resources and Evaluation, 45 (2011 May) 95-119.
[21]
Juniawan I. Klasifikasi Dokumen Teks Berbahasa Indonesia Menggunakan Minor Component Analysis. http://repository.ipb.ac.id/bitstream/handle/123456789/13007/G09iju.pdf?sequence=9&isAllowed=y. 2009;: p. 1-26.
[22]
Aggarwal CC, Zhai C. A Survey of Text Classification Algorithms.;: p. 169-170.

Cited By

View all
  • (2023)Evaluation of Maestro, an extensible general-purpose data gathering and data classification platformInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10345860:5Online publication date: 1-Sep-2023
  • (2020)A Benchmark Study on Machine Learning Methods using Several Feature Extraction Techniques for News Genre Detection from Bangla News Articles & TitlesProceedings of the 7th International Conference on Networking, Systems and Security10.1145/3428363.3428373(25-35)Online publication date: 22-Dec-2020
  • (2019)Using Synonym and Definition WordNet Semantic relations for implicit aspect identification in Sentiment AnalysisProceedings of the 2nd International Conference on Networking, Information Systems & Security10.1145/3320326.3320406(1-5)Online publication date: 27-Mar-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Procedia Computer Science
Procedia Computer Science  Volume 116, Issue C
November 2017
656 pages
ISSN:1877-0509
EISSN:1877-0509
Issue’s Table of Contents

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2017

Author Tags

  1. Classification
  2. Feature Selection
  3. Multinomial Nave Bayes
  4. TF-IDF

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Evaluation of Maestro, an extensible general-purpose data gathering and data classification platformInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10345860:5Online publication date: 1-Sep-2023
  • (2020)A Benchmark Study on Machine Learning Methods using Several Feature Extraction Techniques for News Genre Detection from Bangla News Articles & TitlesProceedings of the 7th International Conference on Networking, Systems and Security10.1145/3428363.3428373(25-35)Online publication date: 22-Dec-2020
  • (2019)Using Synonym and Definition WordNet Semantic relations for implicit aspect identification in Sentiment AnalysisProceedings of the 2nd International Conference on Networking, Information Systems & Security10.1145/3320326.3320406(1-5)Online publication date: 27-Mar-2019

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media