Nothing Special   »   [go: up one dir, main page]

School of Science and Engineering

Department of Electronics Engineering

Egypt

 

 

 

 

 

 

The Arabic Handwritten Digits Databases

ADBase & MADBase

Sherif Abdelazeem, Ezzat El-Sherif,

Electronics Engineering Dept., The American University in Cairo

Download Databases

ADBase training set can be download from here.

ADBase testing set can be downloaded from here.

MADBase training set can be downloaded from here.

MADBase testing set can be downloaded from here.

Introduction

This webpage introduces 2 large Arabic handwritten digits databases suitable for Arabic digit recognition research. The first database is the Arabic Digits dataBase (ADBase) which is composed of 70,000 digits images in bmp format; 60,000 for training and 10,000 for testing. The second database is the Modified ADBase (or the MADBase) which is a modified version of the ADBase.

The ADBase

The ADBase is composed of 70,000 digits written by 700 participants. Each participant wrote each digit (from 0 to 9) twenty times (ten times only used in our database – the other ten times may be used later in writer verification research). To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution. Forms were scanned with 300 dpi resolution then digits are automatically extracted, categorized, and bounded by bounding boxes. We adjusted the scanner to produce binary images directly; so we did not need to binarize the resulting images. Some noisy and corrupted digit images were edited manually. The database is partitioned into two sets: a training set (60,000 digits – 6000 images per class) and a test set (10,000 digits – 1000 images per class). Writers of training set and test set are exclusive. Ordering of including writers to test sets are randomized to make sure that writers of test set are not from a single institution (to ensure variability of the test set).

The MADBase

In our research, we had an objective of establishing benchmark results for the Arabic digit recognition problem using different classification techniques. Another objective of ours is to compare the performances of different classification techniques on both Arabic and Latin digit recognition problems. To make such a comparison valid, the two databases of Arabic and Latin digits should be of the same format. Since we chose the MNIST to be the used Latin digits database, a Modified version of the ADBase (MADBase) that has the same size and format of MNIST has been created. The MADBase is created from ADBase as follows. For each digit of ADBase, its height (h) and width (w) are calculated, and then the digit is size-normalized to have a new height (hnew) and new width (wnew). The assigned values of hnew and wnew depend on whether h or w is greater than the other. If h>w, then hnew is set to 20, and w to floor(20×w/h). if w>h, then wnew is set to 20 and hnew to  floor(20×h/w). This procedure ensures that each digit of MADBase is confined in a 20×20 box while its aspect ration is preserved. Note here that the resulting size-normalized digits have gray levels as a result of the anti-aliasing filter used in size-normalization procedure. The MADBase now has the same size and format of MNIST. Actually, MNIST is a modified version of the digits database NIST as MADBase is modified from ADBase. MNIST can be downloaded from here.

Acknowledgement

We are thankful to all who assisted in building this database: Hossam Hassan, Reem Ater,  Nagwa Ibrahim, Mohammed Ismail, Mohammed Khairy, Karim Ater, Mohammed Ra'afat, and Mohammed Omran.

Contact

Send your comments, suggestions, or inquiries to Ezzat.