Nothing Special   »   [go: up one dir, main page]

skip to main content
note

Printed Text Image Database for Sindhi OCR

Published: 16 May 2016 Publication History

Abstract

Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well as their retrieval. Research on most of the noncursive scripts (Latin) has matured, whereas research on the cursive (connected) scripts is still moving toward perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting the Arabic script; it contains 52 characters, compared to 28 characters in the original Arabic alphabet, in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development, a database is needed for training and testing of Sindhi text images. We have developed a large database containing over 4 billion words and 15 billion characters in 150 various fonts in four font weights and four styles. The database contents were collected from various sources including websites, books, and theses. A custom-built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers words, characters, characters with spaces, and lines. The database is freely available as a partial or full database by sending an email to one of the authors.

References

[1]
Ashraf AbdelRaouf, Colin Higgins, and Mahmoud Khalil. 2008. A database for Arabic printed character recognition. In Image Analysis and Recognition, Lecture Notes in Computer Science. A. Campilho and M. Kamel (Eds.). Vol. 5112, Springer, Berlin. 567--578.
[2]
Ghulam Ali Alana. 1993. Sindhi Sooratkhati (4th ed.). Sindhi Language Authority Hyderabad, Sindh.
[3]
Ghulam Ali Alana. 2004. Sindhi Boli Jo Bunn Bunyad. Sindhi Language Authority Hyderabad. Sindh. Pakistan.
[4]
Yousef Al-Ohali, Mohamed Cheriet, and Ching Suen. 2003. Databases for recognition of handwritten arabic cheques. Pattern Recognition 36, 1 (2003), 111--121.
[5]
Dil Nawaz Hakro, Imdad A. Ismaili, Abdullah Zawawi Talib, Zeeshani Bhatti, and Ghulam Nabi Mojai. 2014. Issues and challenges in Sindhi OCR. Sindh University Research Journal (Science Series) 46, 2 (2014), 143--152.
[6]
Madiha Ijaz and S. Sarmad Hussain. 2007. Corpus based urdu lexicon development. In The Proceedings of Conference on Language Technology 2007 (CLT’07), University of Peshawar, Peshawar, Pakistan. Vol. 73. 1--12.
[7]
Sabri Mahmoud, Irfan Ahmad, Wasfi G. Al-Khatib, Mohammad Alshayeb, Mohammad Tanvir Parvez, Volker Margner, and Gernot A. Fink. 2014. KHATT: An open arabic offline handwritten text database. Pattern Recognition 47, 3 (2014), 1096--1112.
[8]
Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda. 2011. Arabic Gigaword (5th ed. LDC2011T11). Linguistic Data Consortium, Philadelphia University of Pennsylvania.
[9]
Mohammad Tanvir Parvez and Sabri A. Mahmoud. 2013. Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognition 46, 1 (2013), 141--154.
[10]
Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine Ellouze, and Hamid Amiri. 2002. IFN/ENIT database of handwritten arabic words. In Proceedings of of CIFED, 2, Citeseer. 127--136.
[11]
Mutee U. Rahman. 2010. Towards Sindhi corpus construction. In Conference on Language and Technology (CLT’10). Lahore, Pakistan. 37--45.
[12]
Rajneesh Rani, Renu Dhir, and G. S. Lehal. 2011. Identification of printed Punjabi words and English numerals using Gabor features. World Academy of Science, Engineering and Technology 73, 1 (2011), 392--395.
[13]
Fouad Slimane, Rolf Ingold, Slim Kanoun, Adel M. Alimi, and Jean Hennebert. 2009. A new Arabic printed text image database and evaluation protocols. In 10th International Conference on Document Analysis and Recognition, 2009 (ICDAR’09). 946--950.
[14]
Fouad Slimane, Slim Kanoun, Jean Hennebert, Adel M. Alimi, and Rolf Ingold. 2013. A study on font-family and font-size recognition applied to arabic word images at ultra-low resolution. Pattern Recognition Letters 34, 2 (2013), 209--218.

Cited By

View all
  • (2023)Baseline Isolated Printed Text Image Database for Pashto Script RecognitionIntelligent Automation & Soft Computing10.32604/iasc.2023.03642637:1(875-885)Online publication date: 2023
  • (2023)A Large-Scale Font-Diverse Sindhi Ligature Recognition System2023 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT60620.2023.00033(132-137)Online publication date: 11-Dec-2023
  • (2023)Printed Ottoman text recognition using synthetic data and data augmentationInternational Journal on Document Analysis and Recognition10.1007/s10032-023-00436-926:3(273-287)Online publication date: 24-May-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 4
June 2016
173 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2915955
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2016
Accepted: 01 November 2015
Revised: 01 November 2015
Received: 01 April 2015
Published in TALLIP Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Sindhi optical character recognition
  2. Text image database

Qualifiers

  • Note
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Baseline Isolated Printed Text Image Database for Pashto Script RecognitionIntelligent Automation & Soft Computing10.32604/iasc.2023.03642637:1(875-885)Online publication date: 2023
  • (2023)A Large-Scale Font-Diverse Sindhi Ligature Recognition System2023 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT60620.2023.00033(132-137)Online publication date: 11-Dec-2023
  • (2023)Printed Ottoman text recognition using synthetic data and data augmentationInternational Journal on Document Analysis and Recognition10.1007/s10032-023-00436-926:3(273-287)Online publication date: 24-May-2023
  • (2021)Romanized Sindhi Rules for Text CommunicationMehran University Research Journal of Engineering and Technology10.22581/muet1982.2102.0440:2(298-304)Online publication date: 1-Apr-2021
  • (2018)Offline-printed Sindhi Optical Text Recognition: Survey2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS)10.1109/ICETAS.2018.8629110(1-5)Online publication date: Nov-2018
  • (2017)Urdu ligature recognition using multi-level agglomerative hierarchical clusteringCluster Computing10.1007/s10586-017-0916-221:1(503-514)Online publication date: 25-May-2017

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media