Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The Contribution of Stemming and Semantics in Arabic Topic Segmentation

Published: 11 January 2018 Publication History

Abstract

Topic Segmentation is one of the pillars of Natural Language Processing. Yet there is a remarkable research gap in this field, as far as the Arabic language is concerned. The purpose of this article is to improve Arabic Topic Segmentation (ATS) by inquiring into two segmenters: ArabC99 and ArabTextTiling. This study is carried out on two independent levels: the pre-processing level and the segmentation level. These levels represent the basic steps of topic segmentation. On the pre-processing level, we examine the effect of using different Arabic stemming algorithms on ATS. We find out that Light10 is more appropriate for the pre-processing step. Based on this conclusion, we proceed to the second level by proposing two Arabic segmenters called ArabC99-LS-LSA and ArabTextTiling-LS-LSA. These latter use external semantic knowledge related to the Latent Semantic Analysis (LSA). Based on the evaluation results, we notice that LSA provides improvements in this field. Hence, the main outcome of this article emphasizes the multilevel improvement of ATS based on Light10 and LSA.

Supplementary Material

a12-naili-apndx.pdf (naili.zip)
Supplemental movie, appendix, image and software files for, The Contribution of Stemming and Semantics in Arabic Topic Segmentation

References

[1]
A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak. 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 11--16.
[2]
D. Abuaiadah. 2015. Using bisect k-means clustering technique in the analysis of arabic documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 3, Article 17 (Dec. 2015), 13 pages.
[3]
F. S. Al-Anzi and D. AbuZeina. 2017. Toward an enhanced Arabic text classification using cosine similarity and latent semantic indexing. J. King Saud Univ.-Comput. Inf. Sci. 29, 2 (2017), 189--195.
[4]
E. AlShawakfa, A. AlBadarneh, S. Shatnawi, K. Al-Rabab'ah, and B. Bani-Ismail. 2010. A comparison study of some arabic root finding algorithms. J. Am. Soc. Inf. Sci. Technol. 6, 5 (2010), 1015--1024, 2010
[5]
N. Azizi and N. Farah. 2012. From static to dynamic ensemble of classifiers selection: Application to Arabic handwritten recognition. Int. J. Knowl.-Based Intellig. Eng. Syst. 16, 4 (2012), 279--288.
[6]
A. Basu, I. R. Harris, and S. Basu. 1997. Minimum distance estimation: The approach using density-based distances. In Handbook of Statistics, G. S. Maddala and C. R. Rao (Eds.), 15, 21--48. North--Holland.
[7]
D. Beeferman, A. Berger, and J. Lafferty. 1999. Statistical models for text segmentation. Mach Learn. 34, 1 (1999), 177--210.
[8]
S. Ben Guirat, I. Bounhas, and Y. Slimani. 2016. A hybrid model for arabic document indexing. In Proceedings of the 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD’16).
[9]
Y. Bestgen. 2006. Improving text segmentation using latent semantic analysis: A reanalysis of choi, wiemer-hastings and moore. Comput. Ling. 32, 5 (2006), 12.
[10]
Y. Bestgen and S. Pierard. 2006. Comment evaluer les algorithmes de segmentation thematique? essai de construction d'un mmateriel de reference. Traitement Automatique Des Langues Naturelles (TALN’06). 407--414.
[11]
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022.
[12]
M. Boudchiche, A. Mazroui, M. O. A. O. Bebah, A. Lakhouaja, and A. Boudlal. 2016. AlKhalil morpho sys 2: A robust arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci.
[13]
M. Boudchiche, A. Mazroui, M. O. A. O. Bebah, A. Lakhouaja, and A. Boudlal. 2017. AlKhalil morpho sys 2: A robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci. 29, 2 (2017), 141--146.
[14]
A. Boudlal, A. Lakhouaja, A. Mazroui, A. Meziane, M. Ould Abdallahi Ould Bebah, and M. Shoul. 2010. alkhalil morpho SYS1: A morphosyntactic analysis system for arabic texts. In International Arab Conference on Information Technology. 1--6.
[15]
T. Brants, F. Chen, and A. Farahat. 2002. Arabic document topic analysis. In Proceedings of the Workshop on Arabic Language Resources and Evaluation (LREC'02).
[16]
T. Buckwalter. 2004. Buckwalter arabic morphological analyzer version 2.0, Linguistic Data Consortium Catalogue Number LDC2004L02.
[17]
F. Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. In Proceedings of Conference of the Association for Computational Linguistics (NAACL’00). 26--33.
[18]
F. Y. Y. Choi, P. Wiemer-Hastings, and J. Moore. 2001. Latent semantic analysis for text segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language (EMNLP’01).
[19]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391--407.
[20]
L. Du, L. Wray, and J. Mark. 2013. Topic segmentation with a structured topic model. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13).
[21]
S. Dumais. 1992. Enhancing performance in latent semantic indexing (lsi) retrieval. Technical Report TM-ARH017527, Bellcore, Morristown, NJ.
[22]
J. Eisenstein and R. Barzilay. 2008. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
[23]
M. I. Eldesouki, W. M. Arafa, and K. M. Darwish. 2009. Stemming techniques of Arabic Language: Comparative study from the information retrieval perspective. Egypt. Comput. J. 36, 1, 30--49.
[24]
M. A. El-Shayeb, S. R. El-Beltagy, and A. Rafea. 2007. Comparative analysis of different text segmentation algorithms on arabic news stories. In Proceedings of the IEEE International Conference on Information Reuse and Integration. 441--446.
[25]
O. Ferret. 2002. Using collocations for topic segmentation and link detection. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 260--266.
[26]
O. Ferret. 2009. Improving text segmentation by combining endogenous and exogenous methods. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’09). 88--93.
[27]
P. Fragkou, V. Petridis, and K. Ath. 2004. A dynamic programming algorithm for linear text segmentation. Intell. Inf. Syst. 23, 2 (2004), 179--197.
[28]
A. Farzindar and G. Lapalme. 2004. Legal text summarization by exploration of the thematic structures and argumentative roles. In Proceedings of the Workshop on Text Summarization Branches Out (ACL’04).
[29]
H. Froud, A. Lachkar, and S. A. Ouatik. 2012. Stemming versus light stemming for measuring the similarity between arabic words with latent semantic analysis model. In Proceedings of the Information Science and Technology Conference. 69--73.
[30]
S. Ghwanmeh, S. Rabab'ah, R. Al-Shalabi, and G. Kanaan. 2009. Enhanced algorithm for extracting the root of arabic words. In Proceedings of the 6th International Conference on Computer Graphics, Imaging and Visualization. IEEE Computer Society, 388--391.
[31]
J. B. Guillermo, A. L. Jose, O. Ricardo, and E. Inmaculada. 2010. Latent semantic analysis parameters for essay evaluation using small-scale corpora. J. Quant. Ling. 17, 1 (2010), 1--29.
[32]
A. C. Habacha, M. Naili, and S. Sammoud. 2014. Topic segmentation for textual document written in arabic language. KES-2014 Gdynia, Poland, September'14, Procedia Computer Science, 35, 437--446.
[33]
F. Harrag, A. H. Cherif, and A. S. Al-Salman. 2010. Comparative study of topic segmentation algorithms based on lexical cohesion: Experimental results on arabic language. Arab. J. Sci. Eng. 35, 2C (2010), 33--64.
[34]
F. Harrag, A. H. Cherif, and B. Mohamed. 2011. Evaluation of lexical cohesion algorithms for arabic topic segmentation. RIST, 18, 1 (2011), 103--116.
[35]
M. A. Hearst. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages. Comput. Ling. 23, 1 (1997), 33--64.
[36]
M. M. Islam and A. S. M. Hoque. 2012. Automated essay scoring using generalized latent semantic analysis. J. Comput. 7, 3 (2012), 616--626.
[37]
S. Khoja and R. Garside. 2001. Automatic tagging of an arabic corpus using APT. Ph.D. thesis, University of Utah, Salt Lake City, Utah.
[38]
S. S. Kulkarni, U. M. Apte, and N. E. Evangelopoulos. 2014. The use of latent semantic analysis in operations management research. Decis. Sci. 45, 5 (2014), 971--994.
[39]
A. Kundu, V. Jain, S. Kumar, and C. Chandra. 2015. A journey from normative to behavioral operations in supply chain management: A review using latent semantic analysis. Expert Syst. Appl. 42, 2 (2015), 796--809.
[40]
A. Labadie and V. Prince. 2008. Lexical and semantic methods in inner text topic segmentation: A comparison between c99 and transeg. Lecture Notes in Computer Science, vol. 5039. 347--349.
[41]
L. Larkey, L. Ballesteros, and M. Connell. 2007. Light stemming for arabic information retrivial. Arabic Computational Morphology, 38, 221--243.
[42]
T. Magerman, B. Van Looy, and X. Song. 2010. Exploring the feasibility and accuracy of latent semantic analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 82, 2 (2010), 289--306.
[43]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR, arXiv preprint arXiv:1301.3781.
[44]
H. Misra, F. Yvon, J. M. Jose, and O. Cappe. 2009. Text segmentation via topic modeling: an analytical study. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1553--1556.
[45]
M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016a. Parameters driving effectiveness of LSA on topic segmentation. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics CICLing. Lecture Notes in Computer Science. Springer.
[46]
M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016b. Exogenous approach to improve topic segmentation. Int. J. Intell. Comput. Cybernet. 9, 2 (2016), 165--178.
[47]
M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016c. Empirical study of LDA for arabic topic identification. In Proceedings of the 13th African Conference on Research in Computer Science and Applied Mathematics (CARI). 138--145.
[48]
P. Nakov, E. Valchanova, and G. Angelova. 2003. Towards deeper understanding of the lsa performance. In Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP'03).
[49]
R. Olmos, J. A. Leon, G. Jorge-Botana, and I. Escudero. 2013. Using latent semantic analysis to grade brief summaries: A study exploring texts at different academic levels. Lit. Ling. Comput. 28, 3 (2013), 388--403.
[50]
M. A. Otair. 2013. Comparative analysis of arabic stemming algorithms. Int. J. Manag. Inf. Technol. 5, 2 (2013), 1--12.
[51]
A. Pasha, M. Al-Badrashiny, M. T. Diab, A. El Kholy, R. Eskander, N. Habash, and R. Roth. 2014, May. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 1094--1101.
[52]
J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.
[53]
L. Pevzner and M. A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentetion. Comput. Ling. 28, 1 (2002), 19--36.
[54]
M. F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.
[55]
M. M. Rahman, B. C. Desai, and P. Bhattacharya. 2006. Visual keyword-based image retrieval uding latent semantic indexing. In Proceedings of the Correlation-enhanced Similarity Matching and Query Expansion in Retrieval Index (IDEAS'06). IEEE, 201--208.
[56]
M. Reidl and C. Beimann. 2012. How text segmentation algorithms gain from topic models. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’12). 553--557.
[57]
J. C. Reynar. 1980. Topic Segmentation: Algorithms and Applications, Ph.D. thesis, University of Pennsylvania.
[58]
A. Rosenberg and J. Hirschberg. 2006. Story segmentation of broadcast news in English, mandarin and arabic. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Association for Computational Linguistics.
[59]
A. Simon, G. Gravier, and P. Sébillot. 2013. Un modèle segmental probabiliste combinant cohésion lexicale et rupture lexicale pour la segmentation thématique. In 20e Conférence Traitement Automatique Des Langues Naturelles, 20, 202--214.
[60]
N. Soudani, I. Bounhas, and Y. Slimani. 2016. Semantic information retrieval: A comparative experimental study of NLP tools and language resources for arabic. In Proceedings of the 28th International Conference on Tools with Artificial Intelligence (ICTAI’16).
[61]
S. Strassel and M. Glenn. 2003. Creating the annotated tdt-4 y2003 evaluation corpus. Retrieved from http://www.nist.gov/speech/tests/tdt/tdt2003/papers/ldc.ppt.
[62]
K. Taghva, R. Elkhoury, and J. Coombs. 2005. Arabic stemming without a root dictionary. Int. Conf. Inf. Technol. Coding Comput. 1, 52--157.
[63]
A. A. Touir, H. Makhtour, and W. Al-Sanea. 2008. Semantic-based segmentation of arabic texts, inf. Tech. J. 7, 7 (2008), 1009--1015.
[64]
X. Wang, J. T. Sun, Z. Chen, and C. Zhai. 2006. Latent semantic analysis for multiple-type interrelated data objects. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 236--243.
[65]
F. Wild. 2015. Package “lsa”. Retrieved from https://cran.r-project.org/web/packages/lsa/lsa.pdf.
[66]
F. Wild, C. Stahl, G. Stermsek, Y. K. Penya, and G. Neumann. 2005. Factors influencing effectiveness in automated essay scoring with LSA, in artificial intelligence in education-supporting learning through intelligent and socially informed technology. In Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED'05). 947--949.
[67]
F. Wild, D. Haley, and K. Bülow. 2011. Using latent-semantic analysis and network analysis for monitoring conceptual development. J. Lang. Technol. Comput. Ling. 26, 1 (2011), 9--21.
[68]
J. Y. Yeh, H. R. Ke, W. P. Yang, and I. H. Meng. 2005. Text summarization using a trainable summarizer and latent semantic analysis. Inf. Process. Manage. 41, 1 (2005), 75--95.
[69]
M. Yalcinkaya and V. Singh. 2015. Patterns and trends in building information modeling (BIM) research: A latent semantic analysis. Autom. Construct. 59 (2015), 68--80.
[70]
S. Yu, D. Cai, J. R. Wen, and W. Y. Ma. 2003. Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In Proceedings of the International World Wide Web Conference (WWW’03).
[71]
S. Zelikovitz and F. Marquez. 2005. Transductive learning for short-text classification problems using latent semantic indexing. Int. J. Pattern Recogn. Artif. Intell. 19, 2 (2005), 143--163.
[72]
T. Zerrouki. 2010. Tashaphyne, arabic light stemmer/segment. Retrieved from http://tashaphyne.sourceforge.net.

Cited By

View all
  • (2024)Tashaphyne: A Python package for Arabic Light StemmingJournal of Open Source Software10.21105/joss.060639:93(6063)Online publication date: Jan-2024
  • (2020)ArA*summarizer: An Arabic text summarization system based on subtopic segmentation and using an A* algorithm for reductionExpert Systems10.1111/exsy.1247637:2Online publication date: 15-Jan-2020

Index Terms

  1. The Contribution of Stemming and Semantics in Arabic Topic Segmentation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 2
    June 2018
    134 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3160862
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 January 2018
    Accepted: 01 October 2017
    Revised: 01 August 2017
    Received: 01 October 2016
    Published in TALLIP Volume 17, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ArabTextTiling
    2. Arabc99
    3. Arabic stemming algorithms
    4. Arabic topic segmentation

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Tashaphyne: A Python package for Arabic Light StemmingJournal of Open Source Software10.21105/joss.060639:93(6063)Online publication date: Jan-2024
    • (2020)ArA*summarizer: An Arabic text summarization system based on subtopic segmentation and using an A* algorithm for reductionExpert Systems10.1111/exsy.1247637:2Online publication date: 15-Jan-2020

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media