research-article

Urdu text classification

Authors:

Abbas Raza Ali,

Maliha IjazAuthors Info & Claims

FIT '09: Proceedings of the 7th International Conference on Frontiers of Information Technology

Article No.: 21, Pages 1 - 7

https://doi.org/10.1145/1838002.1838025

Published: 16 December 2009 Publication History

Get Access

Abstract

This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot directly interpret the raw dataset, so language specific preprocessing techniques are applied on it to generate a standardized and reduced-feature lexicon. Urdu language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. The empirical results show that Support Vector Machines outperform Naïve Bayes classifier in terms of classification accuracy.

References

[1]

Zhang, H. 2004. The Optimality of Naive Bayes. In: Proceedings of 17^th International FLAIRS Conference, Florida, USA.

Google Scholar

[2]

Rish, I. 2001. An empirical study of the naive Bayes classifier. In: Proceedings IJCAI Workshop on Empirical Methods in Artificial Intelligence, Seattle, USA.

Google Scholar

[3]

Ijaz, M. and Hussain, S. 2007. Corpus Based Urdu Lexicon Development. In: Proceedings of Conference on Language Technology (CLT07), Peshawar, Pakistan.

Google Scholar

[4]

Lowd, D., and Domingos, P. 2005. Naive Bayes Models for Probability Estimation. In: Proceedings of ICML, Germany.

Digital Library

Google Scholar

[5]

Dai, W., Xue, G. R., Yang, Q., and Yu, Y. 2007. Transferring Naive Bayes Classifiers for Text Classification. In: Proceedings of 22^nd AAAI Conference on Artificial Intelligence, British Columbia, USA.

Digital Library

Google Scholar

[6]

Joachims, T. 2001. A Statistical Learning Model of Text Classification for Support Vector Machines. In: Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), New Orleans, USA.

Digital Library

Google Scholar

[7]

Zhuang, D., Zhang, B., Yang, Q., Yan, J., Chen, Z., and Chen Y. 2005. Efficient Text Classification by Weighted Proximal SVM. In: Proceedings of International Conference on Data Mining (ICDM), Houston, Texas, USA.

Digital Library

Google Scholar

[8]

Joachims, T. 2005. A Support Vector Method for Multivariate Performance Measures. In: Proceedings of the 22^nd International Conference on Machine Learning (ICML), Bonn, Germany.

Digital Library

Google Scholar

[9]

Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. In: Proceedings of ECML-98, 10^th European Conference on Machine Learning, Dorint-Parkhotel, Chemnitz, Germany.

Digital Library

Google Scholar

[10]

Lee, Y., Lin, Y., and Wahba, G. 2001. Multicategory Support Vector Machines. In: Proceedings of Computing Science and Statistics Vol. 33, the Interface Foundation, California, USA.

Google Scholar

[11]

Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., and Al-Rajeh, A. 2008. Automatic Arabic Text Classification. In: Proceedings of Actes JADT'2008 en ligne.

Google Scholar

[12]

Joachims, T., Hamza, T., and Noaman, H. M. 1997. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14^th International Conference on Machine Learning, TN, USA.

Digital Library

Google Scholar

[13]

Manning, C. D., Raghavan, P., and Schtuze, H. 2008. An Introduction to Information Retrieval, Cambridge University Press.

Digital Library

Google Scholar

[14]

Jurafsky, D., and James, M. H. 2000. Speech and Language Processing, Prentice Hall.

Digital Library

Google Scholar

Cited By

View all

Abraham AKumar Gupta BSachindeo Maurya ABhushan Verma SHusain MAli AAlshmrany SGupta S(2024)Naïve Bayes Approach for Word Sense Disambiguation System With a Focus on Parts-of-Speech Ambiguity ResolutionIEEE Access10.1109/ACCESS.2024.345391212(126668-126678)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3453912
Ashfaq SChhajro MKhan SLaghari A(2024)Medical assistant chatbot Urdu text sentiment analysisHuman-Intelligent Systems Integration10.1007/s42454-024-00059-3Online publication date: 22-Nov-2024
https://doi.org/10.1007/s42454-024-00059-3
Mehmood FShahzadi RGhafoor HAsim MGhani MMahmood WDengel A(2023)EnML: Multi-label Ensemble Learning for Urdu Text ClassificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/361611122:9(1-31)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1145/3616111
Show More Cited By

Index Terms

Urdu text classification
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Recommendations

A survey on Urdu and Urdu like language stemmers and stemming techniques

Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity
The semantic word similarity task aims to quantify the degree of similarity between a pair of words. In literature, efforts have been made to create standard evaluation resources to develop, evaluate, and compare various methods for semantic word ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

FIT '09: Proceedings of the 7th International Conference on Frontiers of Information Technology

December 2009

446 pages

ISBN:9781605586427

DOI:10.1145/1838002

Conference Chair:
Sajjad Ahmad Madani
COMSATS Institute of Information Technology, Pakistan

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

FIT '09

Sponsor:

FIT '09: 7th International Conference on Frontiers of Information Technology

December 16 - 18, 2009

Abbottabad, Pakistan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
613
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Abraham AKumar Gupta BSachindeo Maurya ABhushan Verma SHusain MAli AAlshmrany SGupta S(2024)Naïve Bayes Approach for Word Sense Disambiguation System With a Focus on Parts-of-Speech Ambiguity ResolutionIEEE Access10.1109/ACCESS.2024.345391212(126668-126678)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3453912
Ashfaq SChhajro MKhan SLaghari A(2024)Medical assistant chatbot Urdu text sentiment analysisHuman-Intelligent Systems Integration10.1007/s42454-024-00059-3Online publication date: 22-Nov-2024
https://doi.org/10.1007/s42454-024-00059-3
Mehmood FShahzadi RGhafoor HAsim MGhani MMahmood WDengel A(2023)EnML: Multi-label Ensemble Learning for Urdu Text ClassificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/361611122:9(1-31)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1145/3616111
Khan SNazir SKhan H(2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 20-Jul-2023
https://dl.acm.org/doi/10.1145/3592600
Zaheer KTalib MHanif MSarwar M(2023)A Multi-Kernel Optimized Convolutional Neural Network With Urdu Word Embedding to Detect Fake NewsIEEE Access10.1109/ACCESS.2023.334187011(142371-142382)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3341870
Das Dawn DKhan AShaikh SPal R(2023)A 2-Tier Bengali Dataset for Evaluation of Hard and Soft Classification ApproachesIETE Journal of Research10.1080/03772063.2023.217367270:3(2430-2452)Online publication date: 20-Feb-2023
https://doi.org/10.1080/03772063.2023.2173672
Sahu SDutta DPal SRasheed I(2023)Effect of Stopwords and Stemming Techniques in Urdu IRSN Computer Science10.1007/s42979-023-01953-44:5Online publication date: 29-Jul-2023
https://doi.org/10.1007/s42979-023-01953-4
Khan LAmjad AAshraf NChang H(2022)Multi-class sentiment analysis of urdu text using multilingual BERTScientific Reports10.1038/s41598-022-09381-912:1Online publication date: 31-Mar-2022
https://doi.org/10.1038/s41598-022-09381-9
Raju GBadugu SAkhila V(2022)Telugu Text Classification Using Supervised Machine Learning AlgorithmSmart Intelligent Computing and Applications, Volume 110.1007/978-981-16-9669-5_27(293-305)Online publication date: 19-Apr-2022
https://doi.org/10.1007/978-981-16-9669-5_27
Deshmukh RKiwelekar A(2022)Deep Convolutional Neural Network Approach for Classification of PoemsIntelligent Human Computer Interaction10.1007/978-3-030-98404-5_7(74-88)Online publication date: 20-Mar-2022
https://doi.org/10.1007/978-3-030-98404-5_7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A survey on Urdu and Urdu like language stemmers and stemming techniques

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity