ANLP Book Habash 2010
ANLP Book Habash 2010
ANLP Book Habash 2010
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00277ED1V01Y201008HLT010
Lecture #10
Series Editor: Graeme Hirst, University of Toronto
Series ISSN
Synthesis Lectures on Human Language Technologies
Print 1947-4040 Electronic 1947-4059
Synthesis Lectures on Human
Language Technologies
Editor
Graeme Hirst, University of Toronto
Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University of
Toronto. The series consists of 50- to 150-page monographs on topics relating to natural language
processing, computational linguistics, information retrieval, and spoken language understanding.
Emphasis is on important new techniques, on new applications, and on topics that combine two or
more HLT subfields.
Dependency Parsing
Sandra Kbler, Ryan McDonald, and Joakim Nivre
2009
Nizar Y. Habash
Columbia University
M
&C Morgan & cLaypool publishers
ABSTRACT
This book provides system developers and researchers in natural language processing and computa-
tional linguistics with the necessary background information for working with the Arabic language.
The goal is to introduce Arabic linguistic phenomena and review the state-of-the-art in Arabic
processing. The book discusses Arabic script, phonology, orthography, morphology, syntax and se-
mantics, with a final chapter on machine translation issues.The chapter sizes correspond more or less
to what is linguistically distinctive about Arabic, with morphology getting the lions share, followed
by Arabic script. No previous knowledge of Arabic is needed. This book is designed for computer
scientists and linguists alike. The focus of the book is on Modern Standard Arabic; however, notes
on practical issues related to Arabic dialects and languages written in the Arabic script are presented
in different chapters.
KEYWORDS
Arabic, natural language processing, computational linguistics, script, phonology, or-
thography, morphology, syntax, semantics, machine translation
This book is dedicated to the memory of my father,
Sakher Habash, who opened my eyes to the beauty of language
and to the elegance of science.
I owe so much to his love, support and faith in me.
ix
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xvii
1 What is Arabic? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Arabic Language and Arabic Dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.2 Arabic Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Arabic Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Elements of the Arabic Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Diacritics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Punctuation and Other Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.5 Arabic Script Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.6 Arabic Typography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Arabic Encoding, Input and Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
2.2.1 Arabic Input/Output Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Arabic Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 NLP Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
2.3.1 Orthographic Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Orthographic Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Automatic Diacritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
2.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
x
3 Arabic Phonology and Orthography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Arabic Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
3.1.2 A Sketch of Arabic Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Phonological Variations among Arabic Dialects and MSA . . . . . . . . . . . . . . .30
3.2 Arabic Orthography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Optional Diacritics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Hamza Spelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Morpho-phonemic Spelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Standardization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 NLP Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
3.3.1 Proper Name Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Spelling Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
3.3.3 Speech Recognition and Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Arabic Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Form-Based Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Functional Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.3 Form-Function Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 A Sketch of Arabic Word Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Cliticization Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Inflectional Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Derivational Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.4 Morphophonemic and Orthographic Adjustments . . . . . . . . . . . . . . . . . . . . . .59
4.3 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6 Arabic Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 A Sketch of Arabic Syntactic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 A Note on Morphology and Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.2 Sentence Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.3 Nominal Phrase Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.4 Prepositional Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Arabic Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.1 The Penn Arabic Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.2 The Prague Arabic Dependency Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xii
6.2.3 Columbia Arabic Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2.4 Comparison: PATB, PADT and CATiB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108
6.2.5 A Forest of Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Syntactic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Nizar Y. Habash
New York, August 2010
Acknowledgments
I would like to thank Owen Rambow, Mona Diab, Tim Buckwalter, Kareem Darwish, Otakar
Smr, Mohamed Maamouri, Ann Bies, Seth Kulick, Alon Lavie, Ryan Roth, Yassine Benajiba,
Kristen Parton, Marine Carpuat, Mohamed Eltantawy, Sarah Alkuhlani, Ahmed El Kholy, Fadi
Biadsy, Wael Abd-Almageed, Katrin Kirchhoff, Bonnie Dorr, Amy Weinberg, Mary El-Kadi, John
P. Broderick, Janet Bing, and Robert Fradkin for helpful discussions and feedback. I would also like
to thank Warren Churchill for his invaluable support and encouragement during the writing of this
book.
Finally, I would like to thank all the researchers working on Arabic NLP and all the funding
agencies that have supported research on Arabic NLP across the world.
Nizar Y. Habash
New York, August 2010
1
CHAPTER 1
What is Arabic?
In the context of this book, the label Arabic is used to refer to a language, a collection of dialects and
a script.
CHAPTER 2
Arabic Script
In this chapter, we discuss the Arabic script primarily as used to write Modern Standard
Arabic (MSA). We start with a linguistic description of Arabic script elements and follow it with a
discussion of computer encodings and text input and display. We also discuss common practices in
NLP for handling peculiarities of the Arabic script and briefly introduce four script-related com-
putational tasks: orthographic transliteration, orthographic normalization, handwriting recognition
and automatic diacritization. The transliteration used for romanizing the Arabic script is discussed
in Section 2.3.1.
2.1.1 LETTERS
Arabic letters are written in cursive style in both print and script (handwriting).They typically consist
of two parts: letter form ( 6789 rasm) and letter mark ( :;"-<'= AijAm). The letter form is an essential
component in every letter. There is a total of 19 letter forms. See Figure 2.1. The letter marks, also
called consonantal diacritics, can be sub-classified into three types. See Figures 2.2. First are dots,
also called points, of which there are five: one, two or three to go above the letter form and one or
two to go below the letter form. Second is the short Kaf, which is used to mark specific letter shapes
of the letter Kaf (see Figure 2.4). Third is the Hamza ( >.?@
( $ hamzah) letter mark. The Hamza can
appear above or below specific letter forms. The term Hamza is used for both the letter form ( A) and
6 2. ARABIC SCRIPT
=
the letter mark, which appears with other letter forms such as ' , B= w, and C= y. The Madda letter
(
mark ( >DE madah) is a Hamza variant.1
Figure 2.1: Letter forms are the basic graphic backbones of Arabic letters.
Figure 2.2: Letter marks are necessary to distinguish different letters.The figure features five dots/points,
the short Kaf, three Hamzas and the Madda.
Specific combinations of letter forms and letter marks result in the 36 letters of the Arabic
alphabet used to write MSA (see Table 2.1 at the end of this chapter). Some letters are created
using letter forms only with no letter marks. Letter marks typically distinguish letters with different
consonantal phonetic mappings although not always. See Figure 2.3. We discuss the question of
sound-to-letter mapping in the next chapter.
Figure 2.3: Letter dots in the first and second clusters from the right create letters with distinct con-
sonantal phonetic values. All the letters in the first cluster from the left are used for marking the glottal
stop in different vocalic and graphic contexts.
1 An additional less common letter mark related to Hamza is the Wasla, which only appears with the Alif letter form in Alif-
!
"
Wasla/Hamzat-Wasl: . This letter is so uncommon it is not part of some encodings of Arabic. We return to discuss this briefly
in Section 3.2.1.
2.1. ELEMENTS OF THE ARABIC SCRIPT 7
Terminology Alert Letter marks, specifically dots, should not be confused with Hebrew Niqqud
dots, which are optional diacritics comparable to Arabic diacritics (Section 2.1.2). Arabic dots and
other letter marks are all obligatory. That said, among researchers in optical character recognition
(OCR), the term diacrtic (i.e., consonantal diacritic) is often used to mean letter mark not diacritic in
the sense used in this book and by most researchers in NLP. The Hamza letter mark in stem-initial
positions tends to be perceived as a diacritic as opposed to stem-medial and stem-final positions
[5]. See Section 2.1.2.
Letter Shapes
Arabic letters have different shapes depending on their position in a word: initial, medial, final or
stand-alone. The letter shapes are used in both print and script, with no distinction. The letter
shapes are also called allographs, and the letters graphemes, by analogy to allophones and phonemes
(Section 3.1.1). Similarly, the context-based selection of letter shape is called graphotactics, by analogy
to phonotactics. The terminology used in font and encoding design is different: letters are characters
and shapes are glyphs (Section 2.2). The initial and medial shapes are typically similar and so are the
final and stand-alone shapes. Most letter forms are written fully connected. However, a few letter
forms are post-disconnective; they connect to preceding letters but not to following letters. All letter
shapes following a post-disconnective letter form are either initial or stand-alone. One letter form,
the Hamza ( A) is fully disconnective. See Figure 2.4.
Associated with disconnective letters are small white spaces that follow the letter creating
visually isolated islands of connected letters, called word parts. In the example in Figure 2.5, there
are two words and five word parts. These spaces make it harder for OCR systems to identify the
boundary of a word correctly. The spaces also can lead to spelling errors that may not stand out
visually: words split into word parts or multiple words attached with no real space. To some extent,
8 2. ARABIC SCRIPT
this problem of identifying which word parts make up a word is similar to the Chinese word
segmentation problem where a word is made up of one or more characters which can be words on
their own [6].
Figure 2.5: Arabic words are mostly connected but may contain small spaces from disconnective letters.
Figure 2.6: Putting it all together: from letters to words. The two words exemplified here are F-(*G ktb
and 4 (
- ;*G ktAb. Read from right to left. Short vowel diacritics are not shown.
Figure 2.6 shows how a word is constructed by putting all of its letters together. Remember
that Arabic is written from right to left when you match up the transliterations with the letters. The
letter Alif (A in green) is a disconnective letter,and as such, it breaks the second word into two word
parts.
Although, in principle, the letter shape is tied to the letter form component, some letters,
(
such as the Ta-Marbuta ( > h) and Alif-Maqsura ( C ),2 share only some of the letter shapes of their
letter forms and are post-disconnective even though their letter forms are not. Moreover, some letter
shapes, as in initial and medial Kaf, lose the letter mark which appears in the final and stand-alone
shapes (See Figure 2.4 and 2.6).
2There are numerous possible romanizations for Arabic names, including Arabic letter names. See Section 3.3.1. In this book, we
try to be internally consistent, but readers should be aware that they will encounter variant spellings, e.g., the Unicode standard
names we display for reference in Table 2.1.
2.1. ELEMENTS OF THE ARABIC SCRIPT 9
Ligatures
In addition to the shape variations, Ara- Figure 2.7: Example of two optional ligatures in one
bic has a large set of common ligatures, differ- font but not another. The second and third letters
ent representations of two or even three letters. and the last two letters in the bottom example forge
Ligatures typically involve vertical positioning vertical ligatures but not in the top example.
of letters (Figure 2.7) and vary by font (Fig-
ure 2.13). All ligatures are optional and font de-
pendent except for the Lam-Alif ligature which
is obligatory: '+ H is represented as I (or medi-
ally as JK) not ;L& . This post-disconnective lig-
ature has three variants that include
= Hamzated
M
Alifs and Alif with Madda: I, I = and I. Liga-
tures pose an added challenge to encoding Ara-
bic. We discuss this in Section 2.2.
1. The basic 28 letters The basic letters of the Arabic alphabet corresponding to Arabics 28
consonantal sounds. They are constructed using all letter forms except for the Hamza letter
form. In all of these letters, the letter marks are fully discriminative distinguishing different
consonants from each other.
2. The Hamza letters There are six. One is the A Hamza-on-the-line, which is made of the
HamzaM = = letter form. The rest use the Hamza
M = and Madda letter marks with other letter forms:
=
( A, ', ', B, =', C). When a Hamzated Alif ( ', ', =') follows a Lam, the obligatory Lam-Alif ligature
10 2. ARABIC SCRIPT
takes on the letter mark too. These letters are not listed as part of the alphabet typically. The
Hamza letters all represent one consonant: the glottal stop (see Section 3.1.2). The different
Hamza forms are governed by a set of complex spelling rules that reflect vocalic context and
neighboring letter forms [7, 8].
3. The Ta-Marbuta This letter is a special morphological marker typically marking a feminine
(
ending. The Ta-Marbuta ( > h), literally tied Ta, is a hybrid letter merging the form of the
( t). Ta-Marbuta only appears in word final positions. When the
letters Ha ( > h) and Ta ( 4
( t). For
morpheme it represents is in word-medial position, it is written using the letter Ta ( 4
( (- (*]E mktbthm their library.
example, \@+ )*-(*]E mktbh+hm library+their is written as \^_*
Although the letter form of the Ta-Marbuta is fully connective, the Ta-Marbuta letter is
post-disconnective.
2.1. ELEMENTS OF THE ARABIC SCRIPT 11
4. The Alif-Maqsura This letter is also a special morphological marker marking a range of mor-
phological information from feminine endings to underlying word roots. The Alif-Maqsura
( C ), literally shortened Alif , is a hybrid letter merging the forms of the letters Alif ( ' A)
and Ya ( C y). The Alif-Maqsura only appears in word-final positions as a dotless Ya. When
+
the morpheme it represents is in word-medial position, it is written using the letters Alif ( '
A) or Ya ( C y). For example,
+ \@+ `%$ a3 baE
( mstf+hm hospital+their is written \@;%$ a3 baE (
mstfAhm their hospital; however, \@+ !c'= Al+hm to+them is written \^_+&'= Alyhm to them.
Although the letter form of the Alif-Maqsura is fully connective, the Alif-Maqsura letter is
post-disconnective.
There are few additional letters that are not officially part of the Arabic script for MSA. Most
commonly seen are 4
3
d p, Nd c, U v and e g. These are borrowings from other languages typically
used to represent sounds not in MSA or the local dialect. See Section 2.1.5.
2.1.2 DIACRITICS
The second class of symbols in the Arabic script is the diacritics. Whereas letters are always writ-
ten, diacritics are optional: written Arabic can be fully diacritized, partially diacritized, or entirely
undiacritized. The NLP task of restoring diacritics, or simply diacritization is briefly introduced in
Section 2.3.4. Typically, Arabic text is undiacritized except in religious texts, children educational
texts, and some poetry. Some diacritics are indicated in modern written Arabic to help readers dis-
ambiguate certain words. In the Penn Arabic Treebank (part 3) [9], 1.6% of all words have at least
one diacritic indicated by their author. Out of these, 99.3% are actually correct, as in they appear in
the correct position in the word.
There are three types of diacritics: Vowel, Nunation, and Shadda. They are presented in
Figure 2.8. Vowel diacritics represent Arabics three short vowels (Fatha /a/, Damma /u/ and Kasra
/i/) and the absence of any vowel (no vowel, Sukun). Nunation diacritics can only occur in word
final positions in nominals (nouns, adjectives and adverbs), where they indicate indefiniteness (see
Section 4.2.2). They are pronounced as a short vowel followed by an unwritten /n/ sound. For
example, 4 f- bu is pronounced /bun/. The Nunation diacritics look like a doubled version of their
corresponding short vowels and are named in Arabic as such: Fathatan, Dammatan, Kasratan [lit.
two Fathas, two Dammas, two Kasras, respectively]. This is simply an orthographic accident and has
no linguistic significance. Shadda is a consonant doubling diacritic: 4 g- b (/bb/). Shadda typically
combines with a vowel or Nunation diacritic: 4gh- bu (/bbu/) or 4fg- bu (bbun). For example, the
i.i g i
word 5-< abara he expressed is pronounced /abbara/. More details on Arabic pronunciation
are presented in Chapter 3. Figure 2.9 shows an example of fully diacritized words.
( $ $=
One other less commonly used diacritic is the Dagger Alif ( ),+ ."-*$ Yj [' k& I' ), aka small Alif
or superscript Alif, which is a diacritic representing a long /a/ vowel (/a/) . It appears in archaic spelling
gl $l
of a few words, e.g., )m&' Allh Allah and ' D@ hA this.
12 2. ARABIC SCRIPT
Quranic spelling makes use of a variety of additional diacritics as a guide to the reading of the
Quran. We will not discuss Quranic Arabic here as it is a specialized form of Arabic that is rather
different from MSA [10].
2.1.3 DIGITS
Arabic numbers are written in a decimal system.There are two sets of digits used for writing numbers
in the Arab World.The Arabic Numerals commonly used in Europe, the Americas and most of the rest
of the world, are only used in Western Arabic countries (Morocco, Algeria, Tunisia). Middle Eastern
Arab countries (e.g., Egypt, Syria, Iraq, Saudi Arabia) use what is called Indo-Arabic numerals. Some
2.1. ELEMENTS OF THE ARABIC SCRIPT 13
non-Arab countries, such as Iran and Pakistan use a variant of the Indo-Arabic numeral set, which
differs in the forms of digits 4, 5 and 6 only. The three digit sets are contrasted in Figure 2.10.
Figure 2.11: Arabic digits are typed from left-to-right in right-to-left text.
Although Arabic is written from right to left, the forms of multi-digit numbers in Arabic are
the same as those used in European (left-to-right) languages. In typing, the multi-digit numbers
are keyed from left-to-right. See Figure 2.11. In handwriting, two-digit number are written right-
to-left but larger numbers start on the left and head rightward. This is a reflection of how Arabic
numbers are commonly uttered in Arabic: in smaller numbers (up to 100), the smaller place-value
digit is uttered (and written first), but in larger numbers, the highest place-value is uttered first. For
14 2. ARABIC SCRIPT
example, a number such as n ,opq 2,345 is uttered as two-thousand three-hundred five and forty.3
Mapping between digit and utterance is important for applications such as text-to-speech and also
language modeling for Automatic Speech Recognition (ASR) [14].
Figure 2.12: Some of the additional letter marks not used in MSA are presented below. In the graph on
the bottom, the inner circle contains MSA Arabic letters, which are used in all extended variants. The
middle circle marked with a dotted border contains letters that infrequently appear in MSA as borrowed
symbols. The outer circle contains non-MSA letters.
FAQ: What are the most prominent differences between the Arabic and Roman scripts from
the point of view of NLP?
Some of the differences, such as script direction, letter-shaping and obligatory ligatures, are effec-
tively abstracted away in computational applications (see Section 2.2) and, as such, are rendered
irrelevant. The two most prominent differences are perhaps optionality of diacritics and lack of
capitalization. Diacritics, or precisely the fact that they are almost never written, put a bigger load
on human readers in a way that is much harder for machines to model compared to Roman-script
languages. We discuss Arabic morphological disambiguation in Chapter 5.The lack of capital/small
letter distinction, which is used in specific ways in different Roman script languages, makes some
applications, such named entity recognition and part-of-speech tagging, more challenging in Ara-
bic.
Figure 2.13: Examples of various Arabic fonts in print, handwriting, graffiti and calligraphy.
Figure 2.14: An example of what a string of Arabic text looks like in memory and on the screen. In
memory, letters are ordered logically (first-to-last). For display purposes, a basic algorithm for reorienting
and shaping the text is used. However, special handling of multiple directions for Arabic digits and
non-Arabic letters can make this a complex task.
In the 1990s, several now-obsolete solutions were created to by-pass the lack of universal
support for Arabic input an display. The basic idea was to encode Arabic allographically by not
only assigning different symbol codes to different letter shapes and ligatures but also internally
encoding Arabic in visual order. Visual order refers to encoding Arabic backward so that it appears
from right-to-left when displayed by a system expecting a left-to-right script [17]. These encodings
suffered from lots of problems and limitations. It is worth pointing out that visual encoding of digits
can still be found in some texts and is a problem for NLP.4
4Try for example googling "1999 )(*|
$ " snh 1999 year 1999 and compare it to googling "9991 )(*|
$ ".
18 2. ARABIC SCRIPT
Figure 2.15: The standard Arabic keyboard layout for PCs. Mac machines have a slightly different
keyboard layout. The keyboard is based on letters, although it also contains Lam-Alif keys which have
the effect of striking Lam then Alif. The person typing strikes the keys in logical order, and the text is
displayed from right-to-left with correct shapes and ligatures.
Before computers, typewriters and print press machines had different type bars (the equivalent
of symbols) for different ligatures and letter shapes. Typing on a typewriter required specifying the
correct letter shape to produce. This is a more complex version of specifying capitals and small letters
in Roman script. The encoding choice used with modern computers provides a major simplification
to Arabic data entry via keyboards. Arabic is simply entered graphemically in logical order. See
Figure 2.15.
We discuss next some of the most commonly used encodings for Arabic.
Figure 2.16: Comparing the correct and incorrect decoding of various Arabic encodings.
Unicode
Unicode is the current de facto standard for encoding a large number of languages and scripts
simultaneously. Unicode was originally designed to use two bytes of information (to code 65,536
unique symbols) and has been expanded since to cover over 1 million unique symbols. For Arabic,
Unicode supports an extended Arabic character set. It also gives Arabic letter shapes and ligatures
unique addresses under what it calls Presentation Forms A and B charts.5 Because Unicode encodes
so many more characters than ISO-8859-6 and CP-1256, conversion from these encodings into
Unicode is possible, but the reverse may be lossy. Although Unicode provides an important solution
to representing the extended Arabic script set, it introduces new challenges. In particular, it introduces
multiple ways to represent the same looking symbol. For instance, the Indo-Arabic and Eastern Indo-
Arabic numbers are all replicated. Similarly, some letters have shapes that may not be distinguished
easily, e.g., W (U+0643) Arabic k and ~ (U+06A9) Persian k, which initially have a similar shape: {G.
This confusion will typically arise when Arabic is typed on a Persian keyboard. Finally, the presence
of presentation form charts allows incorrect allographic encoding that may not be easily detectable
visually. All of these cases make it hard to match strings of text that on the screen look identical
although they are encoded differently.
Terminology Alert The term transliteration is used by many researchers to mean any kind of
mapping from one script to another regardless of the type of mapping. This may include any type
of transcription (strict or ad hoc) and may lead to multiple valid mappings. The most common
variant of this is the task of Proper Name Transliteration, which explores the ways names are
represented in different scripts [21, 22]. We discuss this task in Section 3.3.1.
Arabic characters for computers, e.g., Unicode.The Buckwalter transliteration has been used in many
NLP publications and in resources developed at the Linguistic Data Consortium (LDC). The main
advantages of the Buckwalter transliteration are that it is a strict transliteration (i.e., one-to-one)
and that it is written in ASCII characters, i.e., easily reproducible without special fonts.
One of the common critiques of the Buckwalter transliteration is that it is not easy to read. In
this book, we use the more intuitive Habash-Soudi-Buckwalter transliteration (HSB) [4] variant of
the Buckwalter transliteration. A second critique of the Buckwalter transliteration is that it contains
some characters that are reserved symbols in different computer programming languages such as
Perl or C and representations such as XML, e.g., the curly brackets and dollar sign. To address
this issue, several safe Buckwalter variants emerged, but they are not standardized. Since there are
many variants of this encoding that are needed for different settings, special care is needed not to
mix them up. Finally, the Buckwalter transliteration is also criticized for being monolingual since
ASCII symbols cannot be used to represent English if they are used for Arabic. Some researchers
address this by special markers to escape non-Arabic characters before converting into Buckwalter
from a standard encoding.
Encoding cleanup: Arabic encoding, in Unicode in particular, brings many challenges result-
ing from the many ways one can achieve the same displayed sequence of text with different
underlying characters. First, there are multiple ways to encode seemingly similar characters,
22 2. ARABIC SCRIPT
such as the codes for Indo-Arabic and Eastern Indo-Arabic digits and various related forms of
Arabic letters, e.g., Arabic and Persian Kafs ( W/ ~ k). Many similar looking punctuation marks
appear in different charts under Unicode too. Cleaning up the encoding involves normalizing
the variant symbols into a single form. Second, Arabic presentation forms can be encoded
directly, which results in letter and letter shape ambiguity that cannot be detected easily on the
screen. Complex ligature shapes also add to the problem. Proper normalization would convert
these allographic characters into their graphemic form.
Tatweel removal: The Tatweel symbol is simply removed from the text.
Diacritic removal: Since diacritics occur so infrequently, they are considered noise by most
researchers and are simply removed from the text.
Letter normalization: There are four letters in Arabic that are so often misspelled using
variants that researchers find it more helpful to completely make these variants ambiguous
(normalized). The following are the four letters in order of most commonly normalized to
least commonly normalized (the first two are what most researchers do by default, the last two
are less commonly applied).
M =
1. The Hamzated forms of Alif ( ' A, ' , =' A) are normalized to bare Alif ( ' A).
2. The Alif-Maqsura ( C ) is normalized to a Ya ( C y). In Egypt, but not in other Arab
+
countries necessarily, a final Ya is often written dotless (i.e., as an Alif-Maqsura). However,
2.3. NLP TASKS 23
more recently, the exact opposite can be seen: all Alif-Maqsuras are written as a dotted
Yas.6
(
3. The Ta-Marbuta ( > h) is normalized to a Ha ( > h).
4. The non-Alif forms of Hamza (B= w and C= y) are normalized to the Hamza letter ( A ).
2.3.3 HANDWRITING RECOGNITION
Handwriting Recognition (HR) is the task of converting handwritten or printed input text into
digital text. HR can be classified into offline HR and online HR. In offline HR, the input is typically
a digital image of written text obtained using either scanning or camera photocopying. Online HR
refers to the task of automatic recognition of text input as a sequence of two-dimensional points (as
with using a digital pen or stylus). Optical Character Recognition (OCR) is sometimes distinguished
from offline HR in referring to printed text (as opposed to manually written text). That said, some
researchers use the terms HR (specifically offline) and OCR interchangeably.
HR of handwritten Arabic is still an area of active research, in both offline and online modes,
due to the innate difficulties of the task [24, 25, 26]. Printed Arabic OCR, where the uniformity of
letter shape and other factors allow for easier recognition, is currently of less interest [25].The Arabic
script has several properties that make recognition, particularly of handwritten Arabic, challenging
[27, 28, 25]. These properties include the cursive connected nature of the script complicated with
the existence of disconnective letters, the use of floating letter marks and diacritics (which often shift
horizontally in writing) and the use of vertical ligatures and Tatweel.
The connected script and the use of ligatures make it rather difficult for a machine to distin-
guish between individual characters. This is certainly not a property unique to Arabic; methods, such
as Hidden Markov Models, developed for other cursive script languages can be applied successfully
to Arabic [29, 30, 25, 31]. While Arabic disconnective letters may make it hard to determine word
boundaries, they could plausibly contribute to reduced ambiguity of otherwise similar shapes. The
floating nature of letter marks and diacritics poses different problem for online and offline HR. In
offline HR, trace amounts of dust or dirt on the original document scan can be easily mistaken
for these symbols [27]. Alternatively, these symbols may be too small, light or closely-spaced to be
readily distinguished, causing the system to drop them entirely. For online HR, letter marks and
diacritics, also called delayed strokes, have to be paired with the appropriate letter forms for correct
recognition [24].
The DARPA-funded MADCAT7 program, which targets machine translation of OCRed
handwritten Arabic text, has led to the creation of many resources for training and evaluating
Arabic HR [32]. The National Institute for Standards and Technology (NIST) has a public version
of MADCATs evaluation competition named Open HaRT (Open Handwriting Recognition and
Transcription). For additional resources, see also Appendix C.
6 See the page of the Egyptian daily newspaper Al-Ahram. It is hard to call such a spelling convention a spelling error given its
relative regularity (at least within the same text).
7 MADCAT stands for Multilingual Automatic Document Classification Analysis and Translation.
24 2. ARABIC SCRIPT
2.3.4 AUTOMATIC DIACRITIZATION
Diacritization, also called diacritic restoration, vocalization, vowelization and vowel restoration, is
the process of recovering missing diacritics (short vowels, nunation, the marker of the absence of
a short vowel, and the gemination marker). Diacritization is closely related to morphosyntactic
disambiguation and to lemmatization (Section 5.1) since some of the diacritics vary depending on
syntactic conditions (such as case-related diacritics) and some vary to indicate semantic differences.
The choice of the diacritic on the last written letter of the word (without the possessive
or object clitic which may be attached) is particularly hard since it requires syntactic information:
in imperfective verbs, this diacritic often expresses mood, and in nouns and adjectives, it expresses
syntactic case. Thus, it is often common to define a simpler diacritization task which does not choose
the word-final diacritic.
Much work has been done on Arabic automatic diacritization using a wide range of techniques
[33, 11, 34, 12, 35].
MA Hamza C C1 C1 0621
' Alef Madda Above A | | M C2 C2 0622
=
' Alef Hamza Above > O O C3 C3 0623
B= Waw Hamza Above w & W W C4 C4 0624
=' Alef Hamza Below A < I I C5 C5 0625
C= Yeh Hamza Above y } } Q C6 C6 0626
' Alef A A A A C7 C7 0627
4- Beh b b b b C8 C8 0628
(> Teh Marbuta h p p p C9 C9 0629
(4 Teh t t t t CA CA 062A
43 Theh v v v CB CB 062B
N- Jeem j j j j CC CC 062C
N Hah H H H H CD CD 062D
O Khah x x x x CE CE 062E
P Dal d d d d CF CF 062F
P$ Thal V D0 D0 0630
9 Reh r r r r D1 D1 0631
9$ Zain z z z z D2 D2 0632
Q Seen s s s s D3 D3 0633
Q3 Sheen $ $ c D4 D4 0634
R Sad S S S S D5 D5 0635
R$ Dad D D D D D6 D6 0636
S Tah T T T T D8 D7 0637
S$ Zah D Z Z Z D9 D8 0638
T Ain E E E DA D9 0639
T$ Ghain g g g DB DA 063A
U$ Feh f f f f DD E1 0641
V( Qaf q q q q DE E2 0642
W Kaf k k k k DF E3 0643
H Lam l l l l E1 E4 0644
: Meem m m m m E3 E5 0645
X$ Noon n n n n E4 E6 0646
> Heh h h h h E5 E7 0647
B Waw w w w w E6 E8 0648
C Alef Maksura Y Y Y EC E9 0649
C+ Yeh y y y y ED EA 064A
26 2. ARABIC SCRIPT
CHAPTER 3
1 We follow the common practice of using /.../ to indicate phonemic sequences and [...] phonetic sequences. We use the HSB
transcription [4] with some extensions instead of the International Phonetic Alphabet (IPA) to minimize the number of repre-
sentations used in this book.
28 3. ARABIC PHONOLOGY AND ORTHOGRAPHY
Terminology Alert: The use of terms like phoneme, phone and phonotactics in NLP areas such
as Automatic Speech Recognition (ASR) may not be completely consistent with how linguists
use them. For example, instead of explicit linguistic rules, phonotactics may just refer to n-gram
sequences of phones/phonemes.
Figure 3.1: Arabic consonantal phonemic inventory. Rows represent the different manners of articulation,
while columns represent the different places of articulation. Pairs of phonemes are plain and emphatic
variants. Phonemes in gray are non-MSA (dialectal).
Figure 3.2: Arabic vocalic phonemic inventory. Vowels are represented in terms of height and backness
of the position of the tongue. Phonemes in gray are non-MSA (dialectal).
MSAs basic phonological profile includes 28 consonants, three short vowels and three long
vowels. In addition, MSA has two diphthongs: /ay/ and /aw/. Figures 3.1 and 3.2 present the various
consonantal and vocalic (respectively) phonemes in MSA in terms of their articulatory features (in
3.1. ARABIC PHONOLOGY 29
consultation with [38, 39]). In Figure 3.1, the presence of a pair of phonemes in one cell, as in t T,
$$
indicates that they are plain and emphatic, respectively. Emphasis ( 67+"%(, tafxiym) is a bass effect
giving an acoustic impression of hollow resonance to the basic sound [38]. Emphasis together with
the presence of eight consonants in the velar and post-velar region is what gives Arabic pronunciation
its distinctive guttural quality [38]. Vowel phoneme pairs in Figure 3.2 indicate length difference
(short and long). The phonemes in gray in Figures 3.1 and 3.2 are not MSA, i.e., they are dialectal.
More on this in Section 3.1.3. All of Arabics consonants have direct comparables in English with
the following exceptions:2
/H/ sounds like an h with a hissing quality that can be approximated with the sound made
when breathing on eyeglasses before wiping them clean.
The glottal stop (Hamza) // sounds like the English phone in the middle of uh-oh.
The emphatics /D/, /T/, /S/ and /D/ have a bass quality added to their plain counterparts (/d/,
/t/, /s/, //, respectively).
Notice that the phonemes // and / / correspond to the same phonemes in English often written
with the two-letter combinations sh and th, respectively.
MSA vowel phonemes are limited in number compared to English or French; however, there
are many allophones to each of them depending on the consonantal context [38]. For instance,
contrast the pronunciation of the vowel /a/ in Q;,- /bas/ he kissed and R;,- /baS/ bus, which can
be approximated by the English words bass [the fish] and boss, respectively. This phenomenon is
called emphasis spread. It is a common phonotactic where vowels and consonants near an emphatic
consonant become expressed as their emphatic allophones. Another interesting phenomenon in
MSA vowel pronunciation is the optionality of dropping the final vowel marking syntactic case in
$(
words at the end of utterances (as in the end of a sentence or in citation). This is called waqf ( k2B)
[lit. stopping/pause] pronunciation.
There are numerous additional phonological variations that are limited to specific morpho-
logical contexts, i.e., they are constrained morpho-phonemically as opposed to phonologically. Some
of these phenomena are explicitly expressed in the orthography and some are not. We call cases that
are expressed orthographically morphological adjustments and discuss them in the next chapter. For
example, the phoneme /t/ in verbal pattern VIII becomes voiced and is spelled (not just pronounced)
2 We do not include additional emphatic phonemes that appear in a limited number of minimal pairs (such as emphatic /l/ and
emphatic /b/) or phonemes in borrowed words from foreign languages (such as /p/ and /v/).
30 3. ARABIC PHONOLOGY AND ORTHOGRAPHY
as a P d when adjacent to specific root consonants. On the other hand, we call cases that are not ex-
pressed orthographically, such as the phonological assimilation of the Arabic definite article + H' Al+
to some phonemes that follow it, morpho-phonemic spelling. We discuss these cases in Section 3.2.3.
For a detailed discussion of MSA phonology, stress and syllabic structure, see [39, 38].
The MSA alveolar affricate N- /j/ is realized as /g/ in EGY, as // in LEV and as as /y/
in GLF. For example, *+- handsome is pronounced /jaml/ (MSA, IRQ), /gaml/ (EGY),
/aml/ (LEV) and /yaml/ (GLF).The Levantine and Egyptian pronunciations are considered
standard MSA in those regions.
The MSA consonant V( /q/ is realized as a glottal stop // in EGY and LEV and as /g/ in
(
GLF and IRQ. For example, ,+ . road appears as /Tarq/ (MSA), /Tar/ (EGY and LEV)
and /Targ/ (GLF and IRQ). Other variants are also found in some sub-dialects such as /k/
in rural Palestinian (LEV), /j/ in Emirati (GLF) and /Q/ (voiced /q/) in Sudanese (EGY).
These changes do not apply to modern and religious borrowings from MSA. For instance,
M
X$ '.2( Quran is never pronounced anything but /quran/.
The MSA consonant ( W /k/) is generally realized as /k/ in Arabic dialects with the exception
of GLF, IRQ and the Palestinian rural sub-dialect of LEV, which allow a /c/ pronunciation
in certain contexts. For example, 8 fish is /samak/ in MSA, EGY and most of LEV but
/simac/ in IRQ and GLF.
The MSA consonant 43 // is pronounced as /t/ in LEV and EGY (or /s/ in more recent
(
borrowings from MSA), e.g., )3,J3, three is pronounced /alaa/ in MSA versus /talata/ in
EGY.
The MSA consonant P$ // is pronounced as /d/ in LEV and EGY (or /z/ in more recent
$
borrowings from MSA), e.g., ' D@ this is pronounced /haa/ in MSA versus /hada/ (LEV)
and /da/ EGY.
The MSA consonants R$ /D/ (emphatic d) and S$ /D/ (emphatic //) are both normalized to
/D/ in EGY and LEV and to /D/ in GLF and IRQ. For example, 4 $ $ he continued to
- .01+
hit is pronounced /Dalla yaDrubu/ in MSA versus /Dall yuDrub/ (LEV) and /Dall yuDrub/
(GLF). In modern borrowings from MSA, /D/ is pronounced /Z/ (emphatic z) in EGY and
LEV. For instance, - ;$ police officer is /DabiT/ in MSA but /ZabiT/ in EGY and LEV.
3.2. ARABIC ORTHOGRAPHY 31
MSA has 34 phonemes (28 consonants, 3 long vowels and 3 short vowels). The Arabic script
has 36 letters and 9 diacritics (including the Dagger Alif ). Most Arabic letters have a one-to-one
mapping to an MSA phoneme (Figure 3.3). However, there are some common important exceptions
[4, 42], which we summarize next.
The three short-vowel diacritics, zi a, zh u, and zi i, represent the vowels /a/, /u/ and /i/, respectively.
The short vowel diacritics zh u and zi i are used together with the glide consonants letters B
w and C+ y to denote the long vowels /u/ (as uw) and // (iy). The long vowel /a/ is most
32 3. ARABIC PHONOLOGY AND ORTHOGRAPHY
commonly written as a combination of the short-vowel diacritic zi a and the letter ' A.3 This
makes these three letters ambiguous.
The three nunation diacritics z , fz u and z represent a combination of a short vowel and the
nominal indefiniteness marker /n/ in MSA: /an/, /un/ and /in/, respectively.
The long-vowel diacritic, Dagger Alif zl , represents the long vowel /a/ in a small number of
words.
Arabic diacritics can only appear after a letter. As such, word-initial short vowels are rep-
!
resented with an extra silent Alif, also called Alif-Wasla or Hamzat-Wasl, " (often simply writ-
ten as ' A). Sentence/utterance initial Hamzat-Wasl is pronounced like a glottal stop preceding
the short vowel; however, the sentence medial Hamzat-Wasl is silent. For example, 4- ;(*G F-(*],'$
Ainkataba kitAbu a book was written is pronounced /inkataba kitabun/ but F ( $ (
- *],' 4- ;*G a book
was written kitAbu Ainkataba is pronounced /kitabun inkataba/. A real Hamza is always pronounced
as a glottal stop. The Hamzat-Wasl appears most commonly as the Alif of the definite article Al. It
also appears in specific words and word classes such as relative pronouns, e.g., Aly who and verbs
in Form VII (see Chapter 4).
The most problematic aspect of diacritics is their optionality. This is not so much of a problem
when mapping from phonology to script, but it is in the other direction. Diacritics are largely
restricted to religious texts and Arabic language school textbooks. In other texts, around 1.5% of
the words contain a diacritic. Some diacritics are lexical (where word meaning varies) and others
are inflectional (where nominal case or verbal mood varies). Inflectional diacritics are typically word
final. Since nominal case, verbal mood and nunation have all disappeared in spoken dialectal Arabic,
Arabic speakers do not always produce these inflections correctly or even at all. Notable exceptions
are frequent formulaic expressions such as 6]*+< :Ja&' AlslAm lykm Hello ([lit.] Peace be upon
you) /assalamu alaykum/.
the following word meaning his glory when its case marker changes: >A;^- bahAahu /bahaahu/
= =
(accusative), >B;^- bahAwuhu /bahauhu/ (nominative), and ),;^- bahAyihi /bahaihi/ (genitive).
Hamza spelling is further complicated by the fact that Arabic writers often replace hamzated
= = y
letters with the un-hamzated form, e.g., ' ' A, or through two-letter spelling, e.g., C
AC . These common variations do not always add ambiguity, especially when they are stem-initial:
=
HB '/ HB' wl/Awl first. When they add ambiguity, typically in stem-medial and stem-final positions,
=
they are often avoided: 'D,- bdA he appeared and 'D,- bd he started. Its been observed that Hamzas
in stem-initial Hamzated Alifs are typically perceived by Arab writers as diacritical and optional
compared to stem-medial and stem-final cases, which are more than not considered obligatory [5].
Definite Article The Arabic definite article is a proclitic that assimilates to the first consonant
in the noun or adjective it modifies if this consonant is an alveolar, dental or inter-dental
phoneme (except for /j/).6 This set of 14 consonants is called the Sun Letters. They are 4 ( t, 43
$
, P d, P , 9 r, 9$ z, Q s, Q3 , R S, R$ D, S T, S$ D, H l, and X$ n. For example, the word ?a&'
3
Al+ams the sun is pronounced /aams/ not */alams/.7 The rest of the consonants are called
the Moon Letters; the definite article is not assimilated with them. For example, the word .?%&'
(
Al+qamar the moon is pronounced /alqamar/ not */aqqamar/. Arabic spelling rules require
the addition of a Shadda diacritic on the Sun letter to indicate assimilation without deleting
gi
3 Alams.
the assimilating l, e.g., ?a&'
5The unstressed word-final vowel in ;#< aSaA a stick is shortened.
6 Another classification is that all of these consonants are coronal, i.e., articulated with the flexible front part of the mouth [39].
The exceptionality of /j/ is often attributed to that phonemes likely pre-Classical Arabic pronunciation as a (non-coronal) palatal
[38] or voiced velar stop (/g/) [39]. The situation in dialectal Arabic is similar to MSA, although with some differences [39].
7The star symbol (*) preceding an example is a linguistic marker that indicates the example is incorrect. This star has nothing to
$
do with the Kleene star used in regular expressions, the Buckwalter transliteration of the letter P , or the star used to mark the
selected in-context analysis in the Penn Arabic Treebank.
34 3. ARABIC PHONOLOGY AND ORTHOGRAPHY
Nunation The indefiniteness morpheme spelling with diacritics is another example of
morpho-phonemic spelling that is already mentioned in the discussion of diacritics above.
(
Silent Letters A silent Alif appears in the morpheme 'B+ +uwA /u/ ( )<;?Y-j [' B'B wAw
AljamA ah), which indicates a masculine plural conjugation in verbs. Another silent Alif
appears word finally with some nunated nouns (before or after the diacritic), e.g., ;,- ;(*G ki-
taAbA or kitaAbA /kitaban/. In some poetic readings, this Alif can be produced as the long
vowel /a/: /kitaba/. Finally, a common odd spelling is that of the proper name B.< amrw
/amr/ Amr where the final B w is silent.
Arabic consonants with no exact match in the Roman script such as /H/ and / /;
English consonants foreign to Arabic such as /p/ (approximated as 4- /b/) and /v/ (approxi-
$
mated as U /f/);
Names in English from other European languages also written in Roman script bringing
their own particular orthographic and phonological challenges, such as French pronunciations
which drop final consonants and the many ways to write the same phoneme, e.g., //: sh, sch
and ch among others.
There are many instantiations of the proper name transliteration problem. Here are a few examples.
The Qaddafi problem refers to cases where one spelling in Arabic corresponds to many spellings
$ $(
in English. Whereas the Libyan leaders name is spelled !2' D2 qaAfiy in Arabic, there are
+
numerous English spellings: Qadafi, Qaddafi, Gaddafi, Kaddafi, Kadafy, etc.
The Schwarzenegger problem refers to cases where one spelling in English corresponds to mul-
tiple spelling in Arabic. Here, the single correct spelling for the California governor can appear
in Arabic as ./* $ +,$99'
$ |3 wArzny r, ./$ ,$99' $ |3 wArznyjr, or .Y-Z[$ .$ (,9'|3
$ |3 wArzn r, ."-*+,$99'
wArtznjr, among others. A variant on this problem is the Mozart case, where a couple of
spellings that preserve particular pronunciations appear in Arabic: 49' ( 9E
$ mwzArt (Anglo-
$ mwzAr (Francophonic).
phonic) and 9'9E
The Hassan problem refers to cases where distinct spellings in Arabic of different names collapse
in English.The name Hassan can be a transliteration of a}$ Hasan /Hasan/ or X;a} $ HasAn
/Hassan/.The ambiguity added here is a result of the lack of a method to indicate gemination in
English spelling, especially when s-doubling is used to force an /s/ pronunciation (as opposed to
/z/). A more complex example is the name Salem, which can be an Anglophonic transliteration
of the Arabic name 6&;| sAlim /salim/ or a Francophonic transliteration of the Arabic name
:J| salAm /salem/ (as pronounced in Tunisia).
36 3. ARABIC PHONOLOGY AND ORTHOGRAPHY
The Mary/Mari/Marie problem refers to cases where distinct spellings in Roman script are
collapsed in Arabic. The three names Mary, Mari, and Marie often appear in Arabic spelled
as C 9;E mAry. This can also happen to some Arabic names, whose spelling is ambiguous, e.g.,
+
Salim, Seleem and Slim are three Roman script spellings of the historically same Arabic name
67+| slym influenced by how it is commonly pronounced in the Levant, Egypt and Morocco,
respectively. In a way, this is related to the Qaddafi problem except in that the various Roman
spellings here are distinctive in their reference to different individuals.
The Urshalim/Alquds problem refers to cases that do not have a phonetic match or whose
(
phonetic similarity is partial. For example, the Arabic name for Jerusalem is QD%&' Alquds. The
3 Awrlym bears more resemblance to Jerusalem
Hebraic name for the city in Arabic, 67+|9B'
since the English name comes from Hebrew.
It is important to remember that errors of name transliteration can have major consequences
on the lives of the name bearers, e.g., by unjustly confusing them with suspected individuals. The
problem of proper name transliteration has received a lot of attention in NLP and has been addressed
in a wide range of solutions [21, 43, 44, 45, 22].
8TRANSTAC stands for Spoken Language Communication and Translation System for Tactical Use.
39
CHAPTER 4
Arabic Morphology
Morphology is central in working on Arabic NLP because of its important interactions with both
orthography and syntax. Arabics rich morphology is perhaps the most studied and written about
aspect of Arabic. As a result, there is a wealth of terminology, some of it inconsistent, that may
intimidate and confuse new researchers. In this chapter, we start with a review of different terms used
in discussing Arabic morphology issues. This is followed by a brief sketch of of Arabic morphology.
The next chapter discusses a few important computational problems of Arabic morphology and
reviews their solutions.
Concatenative Morphology
There are three types of concatenative morphemes: stems, affixes and clitics. At the core of con-
catenative morphology is the stem, which is necessary for every word. Affixes attach to the stem.
There are three types of affixes: (a.) prefixes attach before the stem, e.g., + X$ n+ first person plural of
$ + +wn nominative definite masculine
imperfective verbs; (b.) suffixes attach after the stem, e.g., XB
sound plural; and (c.) circumfixes surround the stem, e.g., ,$ +++ 4( t++yn second person feminine
1 Our classification is influenced by [67], who distinguishes between illusory (our form-based) and functional morphology. The
additional classification of functional morphology into logical and formal is not discussed explicitly in this book although the
phenomena they address are presented.
40
Morphology
4. ARABIC MORPHOLOGY
Form-based Functional
Morphemic
Terminology Alert The terms prefix and suffix are sometimes used to refer to proclitics and
enclitics, respectively. Prefix and suffix have also been used to refer to the whole sequence of
affixes and clitics attaching to a stem, e.g., in the databases of the Buckwalter Arabic Mor-
phological Analyzer (BAMA) [23], which treats Arabic words as containing three components:
prefix+stem+suffix. For instance, the example above would be broken up as such in BAMA:
;^B$ + F-(*G+ !78B wasaya+ktub+uwnhA. The stem-initial vowel in this example is considered part
+
of the prefix ya+ in BAMA. This highlights the problem with stem definition, which can be ad hoc
and implementation dependent.
The stem can be templatic or non-templatic.Templatic stems are stems that can be formed using
templatic morphemes (next section), whereas non-templatic word stems (NTWS) are not deriv-
able from templatic morphemes. NTWSes tend to be foreign names and borrowed nominal terms
$ *$ |'3 B wAinTun Washington. NTWS can take nominal affixational and
(but never verbs), e.g.,
$ +*
cliticization morphemes, e.g., X*
$
$ b|'3 &'B wa+Al+wAinTun+iy+uwna and the Washingtonians.
Templatic Morphology
Templatic morphemes come in three types that are equally needed to create a word templatic stem:
roots, patterns and vocalisms. The root morpheme is a sequence of (mostly) three, (less so) four, or
very rarely five consonants (termed radicals).2 The root signifies some abstract meaning shared by all
2 Roots are classified based on the number of their radicals into triliteral (three radicals), quadriliteral (four radicals) and quintiliteral
(five radicals) roots. Some researchers posit that triliteral and other roots were created from biconsonantal roots called etymons
(an earlier form of a word in an ancestor language) [68].
42 4. ARABIC MORPHOLOGY
Figure 4.2: Morphological representations of Arabic words. This figure compares different ways of
representing Arabic words morphologically. Row 1 shows three ambiguous undiacritized words. Row 2
shows two disambiguated diacritized readings for each word in Row 1 (among others). Rows 3 and 4
show allomorphs (stems, affixes and clitics) and morphemes (root, pattern, affixes and clitics), respectively.
Row 5 shows the Lexeme and (some of the) feature-value pairs. n/a means a features is not applicable
(for the particular part-of-speech of the lexeme). Row 6 contains an English gloss for reference.
1 )*-(*GB )(*b-(,; 4- ;(*]&
wktbh kAtbth llktAb
2 wakatabahu wakutubihi kAtabathu kAtibatuhu lilkitAbi lilkutAbi
3 wa+katab+a+hu wa+kutub+i+hi kAtab+at+hu kAtib+at+u+hu li+l+kitAb+i li+l+kutAb+i
ktb +a+hu
wa+ 1a2a3 ktb +i+hu
wa+ 1u2u3 ktb ktb ktb +i ktb +i
4 1A2a3 +at+hu 1A2i3 +ah+u+hu li+Al+ 1i2A3 li+Al+ 1u22A3
5 |katab|V erb |kitAb|Noun |kAtab|V erb |kAtib|Noun |kitAb|Noun |kAtib|Noun
conjunction:wa conjunction:wa conjunction: conjunction: conjunction: conjunction:
particle: particle: particle: particle: particle:li particle:li
article:n/a article: article:n/a article: article:Al article:Al
person:3rd person:n/a person:3rd person:n/a person:n/a person:n/a
gender:masc gender:masc gender:fem gender:fem gender:masc gender:masc
number:sing number:plur number:sing number:sing number:sing number:plur
case:n/a case:gen case:n/a case:nom case:gen case:gen
aspect:perfect aspect:n/a aspect:perfect aspect:n/a aspect:n/a aspect:n/a
object:3MS possessive:3MS object:3MS possessive:3MS possessive: possessive:
6 and he and his she corresponded his writer for the for the
wrote it books [genitive] with him [female] book writers
its derivations. For example, the words F-(*G katab to write, F-(,; kAtib writer, 4- (*]E maktuwb
written and all the words in Figure 4.2 share the root morpheme 4 - - 4( - W k-t-b writing-related.
For this reason, roots are used traditionally for organizing dictionaries and thesauri. That said, root
semantics is often idiosyncratic. For example, the words \Yj laHm meat, \Yj laHam to solder, :;Yj
g
(
laHAm butcher/solderer and )?"E malHamah epic/fierce battle/massacre are all said to have
the same root :- N- H l-H-m whose meaning is left to the reader to imagine.
Not all consonantal combinations are possible in a root. For instance, no roots where all radicals
are copies of the same consonant are allowed, e.g., 4- - 4-- 4- b-b-b. Second and third radicals can
be identical, or geminate, e.g., P- P-9 r-d-d repeating-related, but they cannot be homo-organic,
i.e., produced in the same articulation point [38]. Some roots have one or two weak radicals (the
(
consonants C y or B w). For example, X$ - 9$ - B w-z-n measure-related, H- B- V q-w-l voice-related,
+
or C- :- 9 r-m-y throwing-related. Middle-weak roots are called hollow roots. Final weak roots are
+
called defective roots.
4.1. BASIC CONCEPTS 43
Terminology Alert Roots are bound morphemes, i.e., they cannot appear on their own unlike words.
They are also not pronounceable unlike words and stems. However, the notion of root in Arabic
is sometimes confused by researchers with the notions of word and stem for a variety of reasons.
One reason is that the notion of root in English and other European languages (all non-templatic)
is closer to that of stem in Arabic. Researchers of Arab background make a similar error possibly
because of Arabic orthographys optional diacritics which cause some undiacritized words to look
(
like a sequence of root radicals, e.g., the word F- *G ktb [kataba] he wrote and its root 4-- 4( - W k-t-b.
Although Arabic speakers recognize that roots are not vocalized, they are often pronounced with
the stock pattern 1a2a3a, which, being identical to a verb template, adds to the confusion. Finally,
$ $
the closeness of the Arabic terms for root 9 D- (jar) and stem T D- (ji) may also contribute to
this confusion.
The pattern morpheme is an abstract template in which roots and vocalisms are inserted. We
represent the pattern as a string of letters including special symbols to mark where root radicals
and vocalisms are inserted. We use the numbers 1, 2, 3, 4, or 5 to indicate radical position3 and the
symbol V is used to indicate the position of the vocalism. For example, the pattern 1V22V3 indicates
that the second root radical is to be doubled. A pattern can include letters for additional consonants
and vowels, e.g., the verbal pattern V1tV2V3.
The vocalism morpheme specifies the short vowels to use with a pattern. Traditional accounts
of Arabic morphology collapse the vocalism into the pattern [69]. The separation of vocalisms
was introduced with the emergence of more sophisticated models that abstract certain inflectional
features that consistently vary across complex patterns, such as voice (passive versus active) [70].
Terminology Alert There are many terms used to refer to the concept template. In addition to
template and pattern, researchers may encounter wazn (from Arabic grammar), binyan (from Hebrew
grammar), Form and Measure. The term pattern is used ambiguously to include or exclude vocalisms,
i.e., vocalism-specified pattern and vocalism-free pattern.
A word stem is constructed by interleaving (aka interdigitating) the three types of templatic
(
morphemes. For example, the word stem F - *G katab to write is constructed from the root 4-- 4( - W
k-t-b, the pattern 1V2V3 and the vocalism aa.
3 Often in the literature, radical position is indicated with C with no position distinction. Some researchers make the distinction
using particular letters such as FCL or FML for 123, and KRDS or FMLR for 1234. This pays homage to the long Arabic and
$
Hebrew grammarian tradition of referring to the radicals using the letters of the root for doing-related: H- T- U f--l. So, for
example, the $ 5+< ayn of the root 4-- 4( - W k-t-b is 4( t.
44 4. ARABIC MORPHOLOGY
Form Adjustments
The process of combining morphemes can involve a number of phonological, morphological and
orthographic rules that modify the form of the created word; it is not always a simple interleaving
and concatenation of its morphemic components. These rules complicate the process of analyzing
(
and generating Arabic words. One example is the feminine morpheme, >+ +h (Ta-Marbuta, [lit.
tied T]), which is turned into 4( + +t (also called Ta-Maftuha [lit., open T]) when followed by a
( = =
possessive clitic: \@+ >.5+E ' amiyrahu+hum princess+their is realized as \^(.5+E ' amiyratuhum their
princess. We refer to the 4 ( + +t form of the morpheme (>+ +h, as its allomorph. Similarly, by analogy
to allophones and phonotactics, we can talk about morphotactics, as the contextual conditions that
cause a morpheme to realize as one of its allomorphs. More examples of such rules are discussed in
Section 4.2.
Derivational Morphology
Derivational morphology is concerned with creating new words from other words, a process in which
the core meaning of the word is modified. For example, the Arabic F (
( - ,; kAtib writer can be seen as
derived from the verb F- *G to write katab the same way the English writer can be seen as a derivation
from write. Derivational morphology usually involves a change in part-of-speech (POS).The derived
variants in Arabic typically come from a set of relatively well-defined lexical relations, e.g., location
$ 9$ 678'), actor/doer/active participle ( <;2$ 678') and actee/object/passive participle
$ 678'), time ( X;E
( X;E
( H/%E
$ 678') among many others. The derivation of one form from another typically involves a
(
pattern switch. In the example above, the verb F - *G katab has the root 4-- 4( - W k-t-b and the pattern
1a2a3; to derive the active participle of the verb, we switch in the pattern 1A2i3 to produce the form
F-(,; kAtib writer.
Although compositional aspects of derivations do exist, the derived meaning is often id-
iosyncratic. For example, the masculine noun F (*]E maktab office/bureau/agency and the feminine
( -
noun )*-(*]E maktabah library/bookstore are derived from the root 4 - - 4( - W k-t-b writing-related
with the pattern+vocalism ma12a3, which indicates location. The exact type of the location is thus
4.1. BASIC CONCEPTS 45
idiosyncratic, and it is not clear how the nominal gender difference can account for the semantic
difference.
Inflectional Morphology
On the other hand, in inflectional morphology, the core meaning and POS of the word remain
intact and the extensions are always predictable and limited to a set of possible features. Each
feature has a finite set of associated values. For example, in row (5) column (3) from the left in
Figure 4.2, the feature-value pairs number:plur and case:gen, indicate that that particular analysis of
the word )*-(*GB wakutubihi is plural in number and genitive in case, respectively. Inflectional features
are all obligatory and must have a specific (non-nil) value for every word. Some features have POS
restrictions. In Arabic, there are eight inflectional features. Aspect, mood, person and voice only apply
to verbs, while case and state only apply to nouns/adjectives. Gender and number apply to both verbs
and nouns/adjectives.
Cliticization Morphology
Cliticization is closely related to inflectional morphology. Similar to inflection, cliticization does not
change the core meaning of the word. However, unlike inflectional features, which are all obligatory,
clitics (i.e., clitic features) are all optional. Moreover, while inflectional morphology is expressed
using both templatic and concatenative morphology (i.e., using patterns, vocalisms and affixes),
cliticization is only expressed using concatenative morphology (i.e., using affix-like clitics).
The Lexeme
The core meaning of a word in functional morphology is often referred to using a variety of terms,
such as the lexeme, the lemma or the vocable. These terms are not equal. A lexeme is a lexicographic
abstraction: it is the set of all word forms that share a core meaning and differ only in inflection and
cliticization. For example, the lexeme 1 Fb( +,- bayt1 house includes Fb
( +,- bayt house, Fb
( +*-& lilbayti for
the house and ( +,- buyuwt houses among others; while the lexeme 2 Fb
4* ( +,- bayt2 verse includes to
( +,- bayt verse, Fb
Fb ( +,- =' abyAt verses among others. Note that the
( +*-& lilbayti for the verse and 4;*
singulars in the two lexemes are homonyms4 but the plurals are not. This is called partial paradigm
homonymy. Sometimes, two lexemes share the full inflectional paradigm but only differ in their
( ( ( (
meaning (full paradigm homonymy). For example, the lexemes for 1 >D<;2 qA idah1 rule and 2 >D<;2
qA idah2 base. A lexeme can be referred to uniquely by supplementing the lemma with an index
(as above), with additional forms that are necessary to distinguish the lexeme (such as the plural
form) and/or with a gloss in another language.
By contrast, the lemma (also called the citation form) is a conventionalized choice of one of
the word forms to stand for the set. For instance, the lemma of a verb is the third person masculine
singular perfective form; while the lemma for a noun is the masculine singular form (or feminine
4 See Section 7.1 for more on homonymy.
46 4. ARABIC MORPHOLOGY
singular if no masculine is possible). Lemmas typically are without any clitics and without any
sense/meaning indices. For the examples above, the lemmas are Fb ( +,- bayt and (>D<;2( qA idah, both
of which collapse/ignore semantic differences and morphological differences. Lexemes are commonly
represented using sense-indexed lemmas (as we saw above).
The term vocable is a purely morphological characterization of a set of word forms without
semantic distinctions. Words with partial paradigm homonymy are represented with two vocables
( +,- bayt1 house and 2 Fb
(e.g., 1 Fb ( +,- bayt2 verse); however, words with full paradigm homonymy are
represented with one vocable (e.g.,
(>D<;2( qA idah rule/base).
Terminology Alert The terms for root and stem are sometimes confused with lemma, lexeme and
vocable.
Figure 4.3: Arabic proclitics. The most important clitics are presented with their order class, POS tag,
function and English gloss.
Clitics are usually attached to the word they are adjacent to; however, there are exceptions
that allow them to be separated. For example, the determiner + H' Al+ and prepositional proclitics
can appear unattached when the word is quoted or written in a foreign script, e.g., )(,;2'.5<' {&
$ (
li Ai irAfAtihi for his-confessions or iPod {&' Al iPod the iPod. The clitic in these cases is usu-
ally followed by the Tatweel/Kashida symbol to maintain a connected letter shape (although not
necessarily: iPod H' Al iPod the iPod).
The prepositional proclitics can have pronominal objects, which may lead to what appears to
be a proclitic followed by an enclitic and no word base, e.g., \& la+hum for them. In these cases,
the prepositions are typically considered the base word.
Other Clitics In addition to the clitics described above, there are few other clitics that are less
frequent or more lexically constrained than the clitics discussed so far. These clitics are nonetheless
decliticized in the Arabic treebanks (see Chapter 6), and as such they should not be simply ignored.
The following is a brief description of these clitics.
The negative particles ;E mA and I lA (PATB POS tag NEG_PART) and the vocative
particle ;,+ yA (PATB POS tag VOC_PART) are sometimes treated like proclitics although
4.2. A SKETCH OF ARABIC WORD MORPHOLOGY 49
Figure 4.4: Pronominal enclitics. All pronominal enclitics appear in cliticization class PRO. The PATB
POS tag form is derived from the specific person-gender-number information of the pronoun; e.g., the
tag of the 2nd person masculine plural possessive pronoun 6G+ +kum is POSS_PRON_2MP.
Pronominal Enclitics
PATB 1 POSS_PRON_PerGenNum (nominal possessive)
POS 2 [PIC]VSUFF_DO: PerGenNum (verbal direct object)
Tags 3 PRON_PerGenNum (prepositional object)
Number
Person Gender Singular Dual Plural
1,3 C++ +iy
1st ;,$ + +nA
2 !+$ + +niy
Masc W+ +ka ;?G+ +kumA 6G+ +kum
2nd
Fem W+ +ki $ + kuna
G
Masc >+ +hu ;?@+ \@+ +hum
3rd
Fem ;@+ +hA +humA $ + +huna
@
their cliticization is actually a very common spelling error resulting from these particles ending
with a disconnective letter. For example, the quoted sequence H'., $ + I lA yzAl continue [lit. not
cease] is six times more frequent on the web than its cliticized version H'.,$ + I lA+yzAl.5
( ta+, often called 67%&'
The special preposition + 4
( A;(, Ta of Oath, is almost never used outside
the phrase )gm&;(, taAllahi by God.
The definite article + H' Al+ has a homonym that acts almost exactly the same way in terms of
cliticization but is a relative pronoun.
The word ;E mA, which can be an interrogative pronoun what, a relative pronoun which, that
or a subordination conjunction that (PATB POS tags INTERROG_PRON, REL_PRON
and SUB_CONJ, respectively), can be cliticized to a closed class of words such as ;ED*< $
indamA when [lit. at-which] and ;?*b$ +,- baynamA while [lit. between-which]. In two cases,
the attaching word experiences a change of spelling resulting from assimilation: ;g mimA
$ min+mA; and ;< amA about which is actually ;E+ <
from which is actually ;E+ E $ an+mA.
In these two cases and only these two cases, the enclitic ;E+ +mA is sometimes reduced to :+
ma: g6 mima.
5 H'.,$ + I lA yzAl got 763,000 hits as opposed to 132,000 hits for H'.,$ + I lA+yzAl on Google [ July 15, 2009].
50 4. ARABIC MORPHOLOGY
$ man, which can be an interrogative or relative pronoun who,whom (PATB POS
The word E
tags INTERROG_PRON, REL_PRON), can be cliticized to the prepositions $ min from
E
$ an about. The cliticization leads to a change of spelling similar to the case described
and <
above with;E mA: $ g miman from whom is actually E
$ + E
$ min+man; and $ < aman
$ + <
about whom is actually E $ an+man.
The negative particle I lA (PATB POS tag NEG_PART) appears as an enclitic to the subor-
=
dinating conjunction X$ ' an, which experiences a spelling change resulting from assimilation:
g= =
I ' ala that not is actually I+ X$ ' an+lA. When the prepositional proclitic + H li+ attaches
g=
to this word, it experiences yet another spelling change: J*& liala so that not is actually
=
I+ X$ '+ H li+an+lA.
Arabic dialects introduce several additional clitics, some with new functionality that does not
exist in MSA. For example, a verbal progressive particle, which has no correspondence in
MSA, appears as + 4 - bi+ in Egyptian and Levantine Arabic, as + P da+ in Iraqi Arabic and
+ W ka+ in Moroccan Arabic. The Egyptian and Levantine Arabic progressive clitic + 4 - bi+
is a homograph with the preposition bi+ in/with, which is also present in these dialects. The
MSA future proclitic + Q sa+ is replaced by + N Ha+ in Levantine and Egyptian (appearing
$
also as + {@ ha+ occasionally in Egyptian) and + T a in Moroccan. Levantine, Iraqi and Gulf
Arabic have a demonstrative proclitic + {@ ha+, which strictly precedes with the definite article
+ H' Al+. Several dialects include the proclitic + T a+, a reduced form of the preposition !<
ala on/upon/about/to. Also, several dialect include the non MSA circum-clitic Q3+ + ;E mA+
+, which is used to mark negation. Iraqi Arabic has a contraction of the dialectal question
word *$ |3 inuw what that appears as + Q3 +. For more information on Arabic dialects, see
[72, 73, 74, 75].
Verbal Morphology
Arabic verbal morphology is often described as a very regular, almost mathematical, system with
hardly any exceptions. Verbs inflect for aspect, mood, voice and subject (person, gender and number).
Verbal Forms Arabic verbs have a limited number of patterns: ten basic triliteral patterns and two
basic quadriliteral patterns. The very few additional rare patterns are not discussed here. In western
tradition, verbal patterns are also called Forms (and given a Roman numeral). Figure 4.5 lists the
4.2. A SKETCH OF ARABIC WORD MORPHOLOGY 51
different basic verbal patterns and their general meaning associations. As mentioned earlier, pattern
meaning is mostly idiosyncratic although observations have been made on shared semantics [38].
Form I (triliteral pattern 1V2V3) and Form QI (quadriliteral pattern 1V23V4) are considered the
most basic patterns ( P.Y- mujarad) as opposed to other patterns which are described as augmented
$ maziyd).
[38] or derived [76] ( D,+ .E
Figure 4.5: Arabic verb forms. Patterns for perfective (PV) and imperfective (IV) aspect are provided in
the active and passive voice. Passive voice patterns are in parentheses. All patterns and examples are conju-
gated in the 3rd person masculine singular. Form I has six subtypes that vary in the perfective/imperfective
stem vowel (marked as Vp and Vi , respectively); however, Form I has only one passive voice form per
aspect (regardless of subtype).
Form PV-Pattern IV-Pattern Meaning Example Gloss
I-Vp Vi 1a2Vp 3 a12Vi 3 Basic sense of root - -
(1u2i3) (u12a3)
I-aa 1a2a3 a12a3 - fataH, y+aftaH open
I-au 1a2a3 a12u3 - katab, y+aktub write
I-ai 1a2a3 a12i3 - jalas, y+ajlis sit
I-ia 1a2i3 a12a3 - aDib, y+a Dab be angry
I-ii 1a2i3 a12i3 - Hasib, y+aHsib consider
I-uu 1a2u3 a12u3 - Hasun, y+aHsun be beautiful
II 1a22a3 u1a22i3 Intensification, katab, y+ukatib dictate
(1u22i3) (u1a22a3) causation
III 1A2a3 u1A2i3 Interaction kAtab, y+ukAtib correspond
(1uw2i3) (u1A2a3) with
IV a12a3 u12i3 Causation ajlas, y+ujlis seat
(u12i3) (u12a3 )
V ta1a22a3 ata1a22a3 Reflexive of Form II taEalam, y+ataEalam learn
(tu1u22i3) (uta1a22a3)
VI ta1A2a3 ata1A2a3 Reflexive of Form III takAtab, y+atakAtab correspond
(tu1uw2i3) (uta1A2a3)
VII in1a2a3 an1a2i3 Passive of Form I Ainkatab, y+ankatib subscribe
(in1u2i3) (un1a2a3)
VIII i1ta2a3 a1ta2i3 Acquiescence, Aiktatab, y+aktatib register
(i1tu2i3) (u1ta2a3) exaggeration
IX i12a3a3 a12a3i3 Transformation AiHmar, y+aHmar turn red,
(i12u3i3) (u12a3a3) blush
X ista12a3 asta12i3 Requirement Aistaktab, y+astaktib make
(istu12i3) (usta12a3) write
QI 1a23a4 u1a23i4 Basic sense zaxraf, y+uzaxrif ornament
(1u23i4) (u1a23a4) of root
QII ta1a23a4 a1a2a3a4 Reflexive or tazaxraf, y+atazaxraf be ornamented
(tu1u23i4) (uta1a23a4) unaccusative of QI
Verbal Subject, Aspect, Mood and Voice The verbal subject is specified using three features: person,
( $
gender and number. Person has three values: 1st (speaker, 6*E mutakalim), 2nd (addressee, F - ;Y
muxATab) 3rd (other, F
=$
- ,;< Ayib). Gender has two values: masculine or feminine. And number has
52 4. ARABIC MORPHOLOGY
three values: singular, dual or plural. The verbal subject is indicated through affixations, whose form
is constrained by verbal aspect and mood. See Figure 4.6 for a list of all the verbal subject affixes.
The subjects conjugated with the perfective aspect are only suffixes, while the subjects conjugated
with the imperfective are circumfixes.
$ mADiy), imperfective ( T9;#E
Arabic verbs have three aspects: perfective ( !7;E $ muDAri)
= +
6
and imperative (.E ' amr). The perfective indicates that actions described are completed (perfect)
as opposed to the imperfective which does not specify any such information. The imperative is the
command form of the verb. Aspect is indicated only templatically through pattern and vocalism
combination. In all forms, except for Form I, pattern and vocalism specification are regular. Form I
vocalism has six idiosyncratic variants that share a common pattern. See Figure 4.5. Some verbs can
have more than one variant with no meaning change, e.g., lamas touch can have two imperfective
forms: y+almis and y+almus. In other cases, a change in pattern vowels has different meanings, e.g.,
F-aYZ+[/ F-a} Hasab/y+aHsub compute, Hasib/y+aHsib regard, and Hasub/y+aHsub be esteemed.
Terminology Alert The perfective / imperfective aspect stems are sometimes called s-
stem (suffixing-stem) / p-stem (prefixing-stem) [38], where p-stem refers to the imperfective, not
the perfective.
Arabic has three common moods that only vary for the imperfective aspect: indicative ( T2.E
$
$
marfuw), subjunctive ( 4 - #*E manSuwb), and jussive ( :B .Y$ - majzuwm). One additional archaic
mood is called the energetic. The perfective aspect does not vary by mood, although the value of
the mood feature with the perfective aspect is typically defaulted to indicative. The indicative mood
is also the default mood with the imperfective aspect indicating present or incomplete actions. The
other moods are restricted in their use with particular verb particles. The subjunctive is used after
conjunction particles such as !G kay in order to and the future-negation particle & $ lan will not.
+
The jussive most commonly appears after the past-negation particle 6& lam did not. There are no
specific morphemes in Arabic corresponding to tense, such as past, present or future. Instead, these
various tense values are expressed though a combination of aspect, mood and supporting particles.
For instance, in addition to the two temporally marking particles exemplified above, the particle
$ sawfa will, and its clitic form + Q sa+ are used to indicate the future tense by appearing with
U|
indicative imperfective verbs.
Voice can be passive or active. It is only indicated though a change in vocalism. See Figure 4.5
for examples of aspect and voice templatic morpheme combinations.
Nominal Morphology
In this section, we discuss the morphology of nouns, adjectives and proper nouns, henceforth,
collectively nominals. In comparison to verbs, nominal morphology is far more complex and
6 Some consider the imperative a mood rather than an aspect [77].
4.2. A SKETCH OF ARABIC WORD MORPHOLOGY 53
idiosyncratic. Arabic nominals inflect for gender, number, state and case. Figure 4.7 presents the
different affixational morphemes associated with different combinations of these features.
Gender and Number In functional morphology terms, Arabic has two gender values: masculine
and feminine; and three number values: singular, dual and plural. However, in terms their form, the
story is more complex. First, these two features are commonly expressed using shared morphemes
that represent some number and some gender combination. This is not atypical compared to other
languages. Second, there is a disconnect between the markers of morphemic and functional gender
and number. In around 80% of nominals7 , functional and morphemic genders agree, e.g., Q9DE
mudaris teacher [m.s.],
(
)|9DE $
mudaris+ah teacher [f.s.], X|9DE mudaris+uwna teachers
(
[m.p.], and 4;|9DE mudaris+At teachers [f.p.]. Plurals that agree functionally and morphemically
are called sound plurals ( 6&;a&' ?Y-j ['). However, in the other 20%, functional gender and number do
not match morphemic gender and number. The following are some of the most common patterns
of form-function disagreement.
Broken Plural The most common case of function-form disagreement is broken (irregular)
plural ( .5+a](*&' -) where functional plurals are expressed using a pattern change with sin-
gular affixation. For example, the plural of F ( (
- *]E maktab office is F-,;E makAtib offices (not
$ - (*]E* *maktabuwn). Broken plurals make up around half of all plurals in Arabic. See Fig-
X*
ure 4.8 for a list of common singular-plural pattern pairs. The pairing of singulars and plurals
is basically idiosyncratic, but there are some very common pairs. Some of the plural patterns
appear with singulars also: the words 4 (
- ;*G kitAb book (singular) and H;- 9 rijAl men (broken
(
plural) share the pattern 1i2A3; and similarly, the words F
- *G kutub books (broken plural) and
( *<$ unuq neck (singular) share the pattern 1u2u3. Finally, some nouns have multiple plurals
7 Analyzed in a sample from the Penn Arabic Treebank [9].
54 4. ARABIC MORPHOLOGY
some with subtle distinctions that do not exist any more in MSA.8 The multiple plurals can
be all broken or a mixture of broken and sound.
Broken Feminine The most common way for deriving the feminine form of a masculine noun
(
or adjective is using the feminine suffix >+ +h. However, there are three stable masculine-
feminine pattern pairs that we call Broken Feminine:9
( =
Color/Deformity Adjective: a12a3-1a23A: V9 9$ ' azraq blue [m.] A;29 9$ zarqA blue
(
[f.]
Superlatives: a12a3-1u23aY:.5-G =' akbar greatest [m.] C .5-G kubra greatest [f.]
$ .]| sakrAn drunk [m.] - C.]| sakra drunk [f.]
Other: 1a23An-1a23aY: X'
Basic Gender Mismatch Some of the nouns, particularly those that do not vary for gender (i.e.,
inherently masculine or feminine), have inconsistent morphemic morphology. The following
are some common examples: $ 5+< ayn eye and E; HAmil pregnant are masculine by
($ $
form but feminine by function; and )%*+ xaliyfah caliph is feminine by form but masculine
by function. A few of these nouns can be both feminine and masculine functionally, e.g.,
,( + . Tariyq road. In other cases, the singular form may be correctly masculine, but it takes a
feminine plural suffix (although it remains functionally masculine): D,+ D^( tahdiyd threat [m.s.]
( + D^( tahdiydAt threats [m.p.] and 4;;,
and R;,- bAS bus [m.s.] have the plurals 4'D, ( - bASAt
buses [m.p.], respectively.
Singular Collective Plurals It is important to also distinguish cases of Arabic nouns that
semantically express plurality but are for all intents and purposes singular as far as Arabic
morphology and syntax are concerned. The most common form of these nouns is collective
$
nouns ( bY-j [' 678'), which often are the uncountable form of some countable nouns. For
(
example, the word . tamr dates is a singular uncountable collective noun in Arabic, which
cannot be modified with a number quantifier unlike the countable singular form
(>.( tamrah
( (
( . tamarAt.10 The collective . tamr dates has its own plural
date and its countable plural 4'
(
form 9 tumuwr types of dates.
Complex Disagreement The disagreement in form and function can involve both gender and
(
number. For example, the masculine plural )*-(*G katabah scribes is morphemically feminine
singular; and the feminine plural E'} HawAmil pregnant women is morphemically singular
and masculine.
8 A good example is the distinction of plural of paucity ( )( %&'
( ) and plural of plenty/abundance ( (>.35]&' ), which we do not
- -
discuss here [78].
9 A few feminine nouns have plurals that are different in pattern yet also use sound plural affixation. We call these semi-sound
nouns: the plural of
(>.( tamr+ah palm date is 4'
( .( tamar+At palm dates (not *tamr+At).
10The countable counterpart of a collective noun is called its unit noun ( (>D&' 678'). We consider this a derivational relationship.
56 4. ARABIC MORPHOLOGY
Definiteness and Construct State Arabic nouns inflect for state, which has three values: definite,
indefinite and construct. The definite state is the nominal form that appears most typically with the
definite article and direct addressing with the vocative particle ;,+ yA. For example, 4 (
- ;*]&' Al+kitAb+u.
The indefinite state is used to mark an unspecified instance of a noun, e.g., 4- ;(*G kitAbu a book.
4.2. A SKETCH OF ARABIC WORD MORPHOLOGY 57
The construct state indicates that the noun is the head of an Idafa construction, i.e., it is the first
$ $ muDAf) that is possessed by the noun phrase that follows it, e.g., the word 4 ;(*G
word (or U;#E -
kitAbu book in F-&;&' 4- ;(*G kitAbu AlTAlibi [lit. book the-student] the book of the student. For
some nouns, like 4 (
- ;*G kitAb, the definite and construct state forms are identical. However, this is
not true for all nouns affix combinations. Sound masculine plural nouns have identical definite and
indefinite state forms but different construct state: X*$ - (,;&' Al+kAtib+uwna the writers, and X*
$ - (,;
kAtib+uwna some writers, but *-(,; kAtib+uw writers of .... See Figure 4.7 for more such cases.
Case All Arabic nominals inflect for case, which has three values in Arabic: nominative (Nom
$ marfuw), accusative (Acc 4 #*E$ manSuwb) or genitive (Gen 9B.Y majruwr). The real-
T2.E - -
ization of nominal case in Arabic is complicated by its orthography, which uses optional diacritics
to indicate short vowel case morphemes, and by its morphology, which does not always distinguish
between all cases. Additionally, case realization in Arabic interacts heavily with the realization of
state, leading to different realizations depending on whether the nominal is indefinite, i.e., receiving
$ + *$ (, tanwiyn), definite or in construct state. See Figure 4.7.
nunation ( ,
Eight different classes of nominal case expression have been described in the literature [77, 79].
We briefly review them here:
We first discuss the realization of case in morphemically (form-wise) singular nouns (including
broken plurals). Triptotes are the basic class which expresses the three cases in the singular using the
three short vowels of Arabic: Nom is zh +u, Acc is zi +a, and Gen is zi +i. The corresponding nunated
forms for these three diacritics are: fz +u for Nom, z + for Acc, and z + for Gen. Nominals not
(
ending with Ta-Marbuta ( > h) or Alif Hamza ( A' A) receive an extra Alif in the accusative indefinite
case (e.g, ;,- ;(*G
( i (
i kitAbA book versus ),- ;*Gi kitAbah writing).
Diptotes are like triptotes except that when they are indefinite: (a.) they do not express nunation
and (b.) they use the zi +a suffix for both Acc and Gen. The class of diptotes is lexically specific. It
includes nominals with specific meanings or morphological patterns (colors, elatives, specific broken
plurals, some proper names with Ta Marbuta ending or location names devoid of the definite article).
( =
( .5+,- bayruwt Beirut and V9 9$ ' azraq blue.
Examples include 4B
The next three classes are less common. The invariables show no case in the singular (e.g.,
$
nominals ending in long vowels: ;,+ 9| suwryA Syria or C.GP ikra memoir). The indeclinables
always use the zi +a suffix to express case in the singular and allow for nunation ( !7/E $ mana
meaning). The defective nominals, which are derived from roots with a final weak radical (y or w),
look like triptotes except that they collapse Nom and Gen into the Gen form, which also includes
( $ 2( qADiyA (Acc) a judge.
loosing their final glide: R$ ;2 qAD (Nom,Gen) versus ;*+;
For the dual and sound plurals, the situation is simpler, as there are no lexical exceptions. The
dual and masculine sound plural (the sixth and seventh classes) express number, case and state jointly
i$ h(
in morphemes that are identifiable even if undiacritized: X *-,i ; kAtib+uwna writers [m.p.] (Nom),
i
Xi$ ;*i- (,i ; kAtib+Ani writers [m.d.] (Nom), Xi$ ;(*b-(,i ; kAtib+atAni writers [f.d.] (Nom). The dual and
58 4. ARABIC MORPHOLOGY
masculine sound plural do not express nunation. On the other hand, the feminine sound plural (the
eighth class) marks nunation explicitly, and all of its case morphemes are written only as diacritics,
f( (, ; kAtib+At+u writers [f.p.] (Nom). For all duals and sound plurals, the Acc and Gen
e.g., 4;*-i i
forms are identical, e.g., $ 5+*i- (,i ; kAtib+iyna writers [m.p.] (Acc, Gen) and 4
( ;*-(,i ; kAtib+At+ writers
[f.p.] (Acc, Gen) (see Figure 4.7).
Masdar (9D#E) also called the infinitive or the (de)verbal noun. There are no rules for deriving
the masdar of verbal Form I: :;,$ nAm sleep [v.] :,$ nawm sleep [n.], F (
- *G katab write [v.]
( $
),- ;(*G kitAbah writing [n.], and P daxal enter [v.] H}P $ duxuwl entry [n.]. The
rest of the verbal forms are regular providing one pattern for each form, e.g., the masdar of
Form II (1a22a3) is ta12iy3: .0gG kasar break [v.] .5+a](, taksiyr breaking.
$ $
The active participle ( <;%&' 678' ) and passive participle ( H/% ' 678') both have unique
pattern mappings for every verb form. For example, Form I active and passive participles are
1A2i3 and ma12uw3, respectively: F ( (
- ,; kAtib writer and 4- *]E maktuwb written. The
corresponding patterns for Form X are musta12i3 and musta12a3: the participles of the verb
:D"$ (*|' Aistaxdam use are :D"$ (*aE mustaxdim user and :D"$ (*aE mustaxdam used.
The patterns ma12a3 and ma12i3 are used to indicate nouns of place and time
( X;E $ ' A;8 ='), e.g., F(*]E maktab office from F(*G katab write and Y- majlis council
$ .&'$ B
X; - -
from - jalas sit.
(M
There are several nominal patterns that denote instruments ( )& I' 678') used for the verb they
$
are derived from. For example, mi12A3 is used to derived N;(*%E miftAH key from (*2 fataH
$
$ $
open and 9;a3 bE minAr saw [n.] from .03 naar saw [v.]. Other patterns include 1a22A3ah,
( g kasArah nutcracker from .0gG kassar smash [v.]; and 1A2uw3, e.g., 4 |;
e.g., >9;aG -
HAsuwb computer from F - a} Hasab compute. These forms are rather idiosyncratic in their
formation.
$ (*&'
The pattern 1u2ay3 among others is used to derive the diminutive form (.5+/# 678') of
( 3 ( 3
another noun, e.g., >.5+Y-8 ujayrah shrub is the diminutive of >.Y-8 ajarah tree.
4.2. A SKETCH OF ARABIC WORD MORPHOLOGY 59
g + iy) that maps nouns to
The Ya of Nisba [lit.] Ya of relatedness is a derivational suffix ( C
+
adjectives related to them. It is quite productive compared to other examples of derivational
g = g$ =
$ ' urdun Jordan !gP9 ' urduniy Jordanian, )|;*+| (
morphology. Examples include XP9
+
siyAsah politics !g+78;*+| siyAsiy political and E malik king !g+E malakiy royal.
The last two examples illustrate how Ya of Nisba derivation can include dropping suffixes
(such as the feminine ending) or changing the pattern and/or vocalism. These interaction are
rather ad hoc.
(
The countable counterpart of a collective noun is called its unit noun ( >D&' 678'). It is often
( ( (
derived through a feminine singular suffix, e.g., . tamr dates (collective) >. tamrah date
(singular).
Root-Pattern-Affix Interactions
Morphophonemic Rules There is a large number of morphophonemic rules. We consider three
common sets.
( t in verbal Form VIII ( /(*2'$ i1ta2a3) changes
Form VIII Rules The pattern consonant 4
to P d when the first root radical is 9$ z, P d or P$ . Similarly, the same pattern consonant changes to
S T when the first root radical is an emphatic consonant ( R S, R$ D, S T or S$ D). For example,
(
compare the following verbs, all of which are in Form VIII: 6*| ' Astlm he received (root s-l-m,
default case), .@P9'$ Azdhr he flourished (root z-h-r) and .5-' ASTbr he endured (root S-b-r).
Weak Radical Rules Weak root radicals (B w, C y) change into a vowel or are deleted
+
depending on their vocalic environment. There are several rules with different conditions. The
(
following are some examples with the root H- B- V q-w-l and various patterns and affixes: H;2 qAla
(
( (
he said (not *qawala), H%,+ yaquwlu he says (not *yaqwulu), *+2 qiyla it is said (not *quwila) and
( 2( qultu I said (not *qawaltu). Arabic has a small number of exceptions where a weak radical
F
behaves like a regular consonant, leading often to two contrastive forms with the same root and
pattern but different word form. One example is the pair 4 - ;"-(*|' AistajAba he complied and
4- "-(*|' Aistajwaba he interrogated both of which are derived from the root 4-- B- N- j-w-b and
60 4. ARABIC MORPHOLOGY
the pattern ista12a3. The common practice in the field is to handle the second (non-weak) as coming
from a different root that has a hard w.
Geminate Radical Rules Roots with geminate radicals, e.g., P- P- : m-d-d, also interact with
short vowels in the pattern but only under certain suffix conditions. For example, the following two
( madat /maddat/ she extended
examples use the same template (1a2a3) but different suffixes: 4DE
(not madad+at), but ( madadtu I extended.11
4PDE
Orthographic Rules Some of these rules are lexical, meaning that they are conditioned on specific
morphemes or morpheme boundaries; while others are non-lexical referring only to phonemes or
letters. The majority of orthographic rules are non-lexical. The few exceptions include the spelling
of Alif-Maqsura and Ta-Marbuta.
Alif-Maqsura The rule to write an Alif-Maqsura applies when a third radical C y (defective
+
C+- :- 9 r-m-y and pattern-affix 1a2a3+a
root) is turned into a vowel in word final position: the root
phonologically realize as /rama/ (not ramaya), which orthographically realizes as !9 rama he
threw.
Ta- Marbuta The Ta -Marbuta spelling is also dependent on the affix that follows it. Although
(
in MSA, the Ta -Marbuta is always pronounced as /t/, it is only written as > h when followed by
fi
an affix devoid of an orthographic letter. For example, )(*i- (*]Ei maktabahu /maktabat+un/ library is
spelled with a Ta-Marbuta because the nunnation suffix /un/ is written with a diacritic.
The rest of this section presents some of the more common non-lexical orthographic rules
pertaining to diacritization and Hamza spelling.
Diacritization Appropriate modifications to spell with diacritics are applied regardless of
whether the diacritics are kept in the final word form or not. These include (i.) spelling long vowels
as a combination of a diacritical short vowel and a compatible consonant: // is C iy and /u/ is B uw,
+
(ii.) adding sukuns (no-vowel diacritics) between adjacent consonants, (iii.) adding an Alif word-
initially to words starting with vowel diacritics (the case of Hamzat-Wasl discussed in Chapter 3) and
(vi.) replacing a repeated consonant with a Shadda. The Shadda rule leads to deleting some letters
from stems and affixes. For example, the phonological word /bayyan+na/ they [fem] explained (root
gi i
X$ - C+- 4- b-y-n and pattern-affix 1a22a3+na) is written as $ g5+,i- bayana. With deleted diacritics,
this eight-phoneme word is written with three letters.
Hamza Spelling The Hamza (glottal stop phoneme) is written using seven orthographic
symbols depending on the Hamzas orthographic and phonological context. Some of the numerous
rules include the following. A word-initial Hamza is written with Alif Hamza below ( =' A) when
=
followed by /i/ and with Alif Hamza above ( ' ), otherwise. Another common rule is that a Hamza
between two vowels is written using a character compatible with the higher ranking vowel in the
11 Arabic dialects share some rules with MSA but not others. For example, the geminate radical example used above for (
4PDE
( g
madadtu I extended is realized in Levantine Arabic as F,+ DE /maddet/. This changed form deletes the stem vowel and adds a long
vowel before the suffix t.
4.2. A SKETCH OF ARABIC WORD MORPHOLOGY 61
=
order (high to low) /i/ > /u/ > /a/. For example, the Hamza of *| suyila /suila/ he was asked is
written with a hamzated Ya because /i/ outranks /u/, as opposed to the Hamza of H'| = suwAl /sual/
question, which is written with a hamzated waw because /u/ outranks /a/. For more Hamza spelling
rules and examples, see [77].
Clitic-Word Interactions
The inflected form of a word interacts morpho-syntactically with the clitics attached to it. For
example, nouns followed by the possessive pronouns must be in construct state, nouns following
prepositional proclitics must be in the genitive case and verbs following the future proclitic must be
in the imperfective aspect with indicative mood (see Section 4.2.2). That said, most clitics simply
attach to the inflected word with little or no change in spelling or pronunciation. However, there are a
few important exceptions with consequences to the tasks of tokenization, detokenization, diacritiza-
tion and POS tagging.The most complicated cases involve pronominal clitics and the definite article.
Pronominal Clitics
The u vowel in the +hu- pronominal enclitics, >+ +hu, ;?@+ +humA, \@+ +hum, and @ $ +
+huna, undergoes phonological assimilation to i when following a word that ends with i as
in the nominal genitive case. For example, ),- ;(*G his book can be diacritized as kitAbu+hu,
kitAba+hu or kitAbi+hi.
The 1st person singular pronominal enclitic C++ +iy has an allomorph +ya with words ending
with the letters Alif, Ya or Alif-Maqsura: C ;** $ C++ ;**$ +< aynA+iy), C+ IE
+$ +< aynAya my eyes($
mawlAya my lord ( C+ !cE mawla+iy), !g2 fiya in me ( C+ !2 fiy+iy), and !g< alaya on
+ + + + +
me ( C+ !< ala+iy). Note that in the case of words ending with Ya or with an Alif-Maqsura
+
that turns into a Ya (last two examples above), the assimilation is orthographically represented
with a Shadda, which means the undiacritized word (with or without the pronominal enclitic)
is not distinguishable in the case of Ya and minimally different in the case of an Alif-Maqsura.
The 1st person singular pronoun C++ +iy overrides the word-final case marker effectively nor-
(
malizing case for such words: ! ;*G kitAbiy my book can be underlyingly kitAbu+iy, kitAba+iy
+-
or kitAbi+iy (nominative, accusative or genitive, respectively).
When pronominal enclitics other than C++ +iy attach to the preposition + H li+, the form of
the preposition is changed to la+. However, this is not the case with the preposition + 4 - bi+:
compare 6]& la+kum for you, \& la+hum for them (hu-/hi- assimilation averted), 6],- bi+kum
by you and \^- bi+him by them (with hu-/hi- assimilation).
62 4. ARABIC MORPHOLOGY
When followed by a pronominal enclitic, word-final Ta-Marbuta is rewritten as Ta: ;,$ + )*-(*]E
(
mktbh+nA becomes ;*$ (*b-(*]E mktbtnA our library.The resulting word spelling can be ambiguous
with words not originally containing a Ta-Marbuta: ;*$ (*b-(,; can be ;,$ + Fb
( - (,; kAtabat+nA she
(
corresponded with us or ;,$ + )*-(,; kAtibahu+nA our [female] writer.
Definite Article
The Lam of the definite article + H' Al+ phonologically assimilates if followed by a so-called
Sun letter (see Section 3.2.3). Assimilation is indicated by doubling the first letter of the
word (with a Shadda) and counterintuitively not deleting the assimilating letter in the definite
article (to preserve the words morphemic spelling). No diacritic is provided on the Lam of
3 g3
the definite article. For example, 8+ H' Al+amsu the sun is written as ?a&' Alamsu;
( (
however, .2+ H' Al+qamaru the moon is written as .?%&' Alqamaru.
The Alif of the definite article + H' Al+ is deleted when preceded by the prepositional proclitic
+ H li+: 4 ( (
- ;*]&'+ H li+AlkitAbi for the-book becomes 4- ;*]& lilkitAbi. A similar case of phono-
logical elision occurs with the prepositional proclitic + 4
- bi+, but without the spelling change:
4- ;(*]&'+ 4- bi+AlkitAbi by the-book remains 4- ;(*]&;,- biAlkitAbi.
The interaction between the definite article and nominals starting with the letter H l is com-
plex. The letter H l is considered a Sun Letter and as such the definite article is not deleted
(g
(although considered silent) and the first letter of the word is geminated: )/$ &' Allu ah
( $ + H' Al+lu ah) the language. However, when the Alif of the definite article is deleted
( )/&
following the prepositional proclitic + H li+, the special status of the silent Lam of the definite
4.3. FURTHER READINGS 63
article is revoked (only with Lam-initial nominals!). The result is an ambiguity for the whole
class of Lam-initial nominals of whether the definite article is present or not following the
( $ ll h can be lilu ah ( )(/&$ + H li+lu ah) for a language, or
prepositional proclitic + H li+: )/&
( $ + H'+ H li+Al+lu ah) for the language.
lilu ah ( )/&
CHAPTER 5
Computational Morphology
Tasks
In this chapter, we discuss a set of common computational morphology tasks and the various ap-
proaches to address them. Most of these tasks are not an end in themselves, e.g., part-of-speech
(POS) tagging or root extraction.They are support (enabling) technologies that are crucial for higher
order applications such as machine translation (MT), information retrieval (IR) or automatic speech
recognition (ASR). A few serve both roles (an end and a mean), in particular automatic diacritiza-
tion, which can be seen as a standalone application that allows users to transform undiacritized text
into diacritized text and as a tool to enable text to speech.
In the following section, we define a number of these tasks and relate them to each other. In the
next three sections, we discuss in more detail three sets of tasks: morphological analysis/generation,
tokenization and part-of-speech tagging. In the last section in this chapter, we compare and contrast
in detail two commonly used tools for Arabic processing that handle different subsets of these tasks.
Context: Some tasks are non-contextual (out-of-context) and others are contextual (in-
context). Out-of-context tasks focus on describing the set of possible values (such as POS
tags, diacritizations, lemmas, roots, etc.) associated with a particular word, in general. In con-
trast, in-context tasks focus on selecting the context-appropriate values (again, whether it is a
diacritization, POS tag, lemma, root, etc. ). Morphological analysis and morphological disam-
biguation are the prototypical out-of-context / in-context tasks, respectively. Every task can be
defined in these two modes. For example, out-of-context tokenization is a task to determine
for a word all the possible tokenizations it can have. The most common form of tokenization
is to select a specific choice in-context. Different computational approaches to tokenization
may (or may not) explicitly or implicitly represent the out-of-context choices internally.
5.2. MORPHOLOGICAL ANALYSIS AND GENERATION 67
Richness: Some tasks differ in being shallow or deep, coarse or fine-grained. For example,
there is a large number of POS tagging sets that can be based on form-based morphemic
morphology (shallow) or functional morphology (deep); they can focus on the core tag of the
main word (coarse) or extend to cover all the values of the inflectional features and clitics.
Similarly, tokenization includes different ways of representing the word tokens including
lemmas, stems, roots or even specific generated word forms; and diacritization can be full or
partial.
Directionality: Some tasks are primarily analytical, i.e., mapping from the surface form to
a deeper form; others are generative, i.e., mapping from a deeper form to a shallower form.
Morphological analysis and generation are the prototypical tasks of these two categories. Most
tasks involving the selection of a subset of the word features, such as the lemma, root, etc.,
are analytical. Normalized tokenization, a task that focuses on producing a naturally occurring
surface form is partly analytical and partly generative. The word is effectively analyzed to
determine its components, but then a correct form of the tokenized word is generated. For
example, the handling of Ta-Marbuta in words containing a pronominal enclitic necessitates
rewriting the word form once the enclitic is segmented off.
Lexicon and Rules The lexicon and rules are the core knowledge base of any morphological
analysis/generation system. The lexicon typically holds all the specific lexical knowledge that
can include allowed root-pattern combinations (open class information), affixations (closed
class information), word inflectional classes (morphological order and compatibility infor-
mation), and even additional useful information such as entry glosses in another language
(this is not necessary for analysis/generation). Rules, on the other hand, are typically gener-
alizations addressing lexically independent phenomena, such as Ta-Marbuta spelling among
other things. In certain ways, the rules and the lexicon are on a continuum of generality of
morphological information: the lexicon is essentially a long list of very specific rules. What
information is represented in the lexicon versus the rules is completely up the designers of the
system. Most cases are clear cut decisions, but some can go either way. Obviously, for a system
to function correctly, the lexicon and rules should be in synch. This is why it is often hard (or
not straightforward) to reuse lexicons or rules from one system in another. Rules and lexicon
can be either manually created or automatically/semi-automatically learned. As the knowledge
base of the system, lexicons and rules contrast with the analysis/generation engine that uses
them to accomplish the task. In certain language specific implementation, this distinction is
lost and the engine may contain hard-coded rules. A very simple morphological analyzer can
consist of nothing but a lexicon that lists all possible word variations and their corresponding
analyses. One example of this is the Egyptian Colloquial Arabic lexicon [52]. Some analyzers
may allow rule-based back-off output that does not appear in their lexicon. For example, the
Buckwalter Arabic Morphological Analyzer [23] typically produces additional proper noun
readings not in its lexicon.
Internal Representation
The internal representation of lexicon and rules varies widely among different systems. For
instance, some approaches use a simple prefix-stem-suffix representation [23, 96] as opposed to
roots, patterns and affixations [86, 80]. Even the use of patterns can vary: morphemic patterns
that require rules to be fully inflected or allomorphic patterns that require more complex lexicon
entries [67]. The internal representation can be to some degree independent of the external
representation. For instance, a lexicon using stems as its internal representation for lookup and
matching can have hard-coded root and pattern information associated with each stem. The
internal representation is often than not a bound representation that is not valid outside the
confines of the system that uses it. As such, it should not be used under different assumptions
of validity. For example, using a stem lexicon as dictionary for say spelling correction is an
invalid use of this resource as many of its entries are partial spellings of words that combine
inside the analysis/generation system that use the lexicon.
Engine A variety of frameworks and programming languages have been used for analysis and
generation, with various degrees of sophistication, robustness and efficiency. Some of the more
complex solutions, such as using finite state machinery [86, 80], trade elegance and reversibility
with lower speed and large model sizes as opposed to simpler code-based solutions [23, 96, 67].
Directionality Some systems are focused on analysis only or generation only as opposed to
both. Certain techniques are inherently reversible such as finite state machinery, but others
are not, such as code-based solutions. If the analysis target representation is very shallow,
generation may not be hard or meaningful.
Extensibility Different approaches to morphology vary in how easily extensible they are.
The more hard-coded and merged the rules and lexicon are, the more complex the extension
process. One particular challenge is extending systems for MSA to handle Arabic dialects,
which require updates to both rules and lexicons.
Performance and Usability There are numerous dimensions for judging performance and
usability. Coverage, in terms of both lexical coverage and coverage of morphological phenom-
ena, is an important metric. Both analysis and generation systems should only output correct
analyses and realizations (generated forms), respectively, and nothing but those analyses and
70 5. COMPUTATIONAL MORPHOLOGY TASKS
realizations. Lower precision1 or lower recall2 in the output is not desirable. Another aspect of
performance is robustness to incorrect/misspelled input. Some of the better analyzers propose
alternative corrections as part of the analysis. This is rather necessary for handling cases of
common spelling errors, such as mis-hamzated Alif forms. Finally, the question of usability is
really dependent on the application the analyzer/generator is used in. For some applications,
a system with lower coverage but appropriate depth in external output is far more desirable
than a system that has high coverage but shallow or inappropriate output.
Lexicon An Arabic word is viewed as a concatenation of three regions, a prefix region, a stem
region and a suffix region. The prefix and suffix regions can be null. Prefix and suffix lexicon entries
cover all possible concatenations of Arabic prefixes and suffixes, respectively. For every lexicon entry,
a morphological compatibility category, an English gloss and occasional part-of-speech (POS) data
are specified. Stem lexicon entries are clustered around their specific lexeme, which is not used in
the analysis process. Figure 5.1 shows sample entries:3 the first six in the left column are prefixes; the
rest in that column are suffixes; the right column contains seven stems belonging to three lexemes.
The stem entries also include English glosses, which allow the lexicon to function as a dictionary.
However, the presence of inflected forms, such as passives and plurals among these glosses makes
them less usable as English lexemic translations.
Compatibility Tables Compatibility tables specify which morphological categories are allowed
to co-occur. For example, the morphological category for the prefix conjunction B/wa wa+ and,
Pref-Wa, is compatible with all noun stem categories and perfect verb stem categories. How-
ever, Pref-Wa is not compatible with imperfective verb stems because Bama imperfective prefixes
must contain a subject prefix morpheme. Similarly, the stem 4 (
( - ;*G/kitAb kitAb of the the lexeme
1_ 4
- ;*G/kitAb_1 kitAb book has the category (Ndu), which is not compatible with the category of
(
the feminine marker >/ap ah: NSuff-ap. The same stem, 4 (
( - ;*G/kitAb kitAb, appears as one of the
stems of the lexeme 1_ ),- ;(*G/kitAbap_1 kitAbah writing with a category that requires a suffix with
the feminine marker. Cases such as these are quite common and pose a challenge to the use of stems
as tokens since they can add unnecessary ambiguity.
1 In this context, the precision of a particular system is defined as the number of correct analyses/realizations produced by the
system divided by the number of all analyses/realizations it produced.
2 In this context, the recall of a particular system is defined as the number of correct analyses/realizations produced by the system
divided by the number of all analyses/realizations in the evaluation reference.
3The Buckwalter transliteration is preserved in examples of Buckwalter lexicon entries (see Chapter 2).
5.2. MORPHOLOGICAL ANALYSIS AND GENERATION 71
Analysis Engine The analysis algorithm is rather simple since all of the hard decisions are coded
in the lexicon and the compatibility tables: Arabic words are segmented into all possible sets of
prefix, stem and suffix strings. In a valid segmentation, the three strings exist in the lexicon and are
three-way compatible (prefix-stem, stem-suffix and prefix-suffix). Bama produces multiple analyses
that are tuples of full diacritization, lemma, and morpheme analysis and morpheme tags (also called
the Buckwalter POS tag; see Figure 5.4). For example, the word F (
- *]& llktb for the books would
return an analysis specifying its diacritization as lilkutubi, its lemma as kitAb_1, and its morpheme
analysis and tags as li/PREP+Al/DET+kutub/NOUN+i/CASE_DEF_GEN.
There are currently three version of Bama: Bama 1.0/1.2 are both publicly available. Bama
2.0 and Sama 3.0/3.1 (Standard Arabic Morphological Analyzer, essentially Bama 3.0/3.1) are
available through the LDC. See Appendix D for links to these resources.
Analysis Analysis in Almorgeana is similar to Bama: the word is segmented into prefix-stem-
suffix triples, whose individual presence and bilateral compatibility is checked against the Bama
databases. The difference lies in an extra step that uses lexeme and feature keys associated with stem,
prefix and suffix string sequences to construct the lexeme and feature output. For example, the word
F-(*]& llktb for the books returns the following analysis:6
(5.1) lilkutubi=[kitAb_1 POS:N l+ Al+ +PL +GEN]=books
Here, lilkutubi is the diacritized form of the word. Inside the square brackets, we find the nom-
inal lexeme kitAb_1 book, the proclitic preposition l+ to/for, the definite article Al+ the, the
feature +PL plural and the feature +GEN genitive case. Most of the information in the fea-
ture set is directly derivable from the morpheme tags in the Bama output for the same word:
li/PREP+Al/DET+kutub/NOUN+i/CASE_DEF_GEN. However, the feature +PL indicating plurality is not.
It is part of the extension done in Almorgeana in processing the Bama databases.
Generation In generation, the input is a lexeme and feature set. The generated output is a fully
inflected and diacritized word. For example, [kitAb_1 POS:N l+ Al+ +PL +GEN] generates lilkutubi.
The process of generating from lexeme and features is similar to analysis except that lexeme and
feature keys are used instead of string sequences. First, the feature set is expanded to include all forms
of under-specified obligatory features, such as case, gender, number, etc. Next, all lexeme and feature
keys in the Almorgeana lexicon that fully match any subset of the lexeme and expanded feature
set are selected. All combinations of keys that completely cover the lexeme and expanded feature
set are matched up in prefix-stem-suffix triples. Then, each key is converted to its corresponding
prefix, stem or suffix string. The same compatibility tables used in analysis are used to accept or reject
prefix-stem-suffix triples. Finally, all unique accepted triples are concatenated and output. In the case
that no surface form is found, a back-off solution that attempts to regenerate after discarding one
of the input features is explored.
See [97] for more details on Almorgeana and an evaluation of its performance. Al-
morgeana is the analyzer/generator used inside the MADA toolkit, which we discuss in detail
in Section 5.5.
Figure 5.2: Almorgeana features and their possible values (version 2.0). Clitic features (such as
conjunction and preposition) are optional; however, all other features are obligatory (although in some
cases POS dependent, e.g., nouns do not take aspect or voice). State is handled using two features:
definiteness and possession.
as concrete morphemes. The concrete templatic morphemes are interdigitated and affixes added.
Separate morphophonemic and orthographic rewrite rules are applied. Mageads implementation
of rewrite rules follows [100] in using a multi-tape Finite-state transducer (FST) representation.
This is similar to other FST-based implementations for Arabic morphology [86]. The use of explicit
linguistic rules inside Magead distinguishes it from other more opaque implementations such as
Bama and Almorgeana, in which the rules are effectively hard-coded in the form of the stem.
This transparency makes Magead a more complex system in certain ways, but it also makes it easier
to extend to new dialects. The distinction between different levels of representation also allows using
Magead for a variety of tasks such as mapping from an orthographic form to a phonological form.
In the rest of this section, we discuss Mageads components in more detail using an illustrative
example.
Lexeme and Features Mageads morphological analyses are represented in terms of a lexeme and
features. Magead defines the lexeme to be a triple consisting of a root, a morphological behavior class
(MBC), and a meaning index. It is through this view of the lexeme that Magead can both have
74 5. COMPUTATIONAL MORPHOLOGY TASKS
a lexeme-based representation, and operate without a lexicon (as may be needed for dealing with a
dialect). In fact, because lexemes have internal structure, Magead can hypothesize lexemes on the
( 9'$ Aizdaharat she/it flourished
fly without having to make wild guesses. For example, the word 4.@P
has the following lexeme-and-features analysis in Magead:
Morphological Behavior Class An MBC maps sets of linguistic feature-value pairs to sets of ab-
stract morphemes. For example, MBC verb-VIII maps the feature-value pair ASPECT:PERF
to the abstract root morpheme [PAT_PV:VIII], which in MSA corresponds to the concrete root
morpheme V1tV2V3, while the MBC verb-II maps ASPECT:PERF to the abstract root mor-
pheme [PAT_PV:II], which in MSA corresponds to the concrete root morpheme 1V22V3. MBCs
are defined using a hierarchical representation with non-monotonic inheritance. The hierarchy al-
lows Magead to specify only once those feature-to-morpheme mappings for all MBCs that share
them. For example, the root node of the MBC hierarchy is a word, and all Arabic words share certain
mappings, such as that from the linguistic feature conj:w to the clitic w+. This means that all Arabic
words can take a cliticized conjunction. Similarly, the object pronominal clitics are the same for all
transitive verbs, no matter what their templatic pattern is. The design of Magead assumes that the
MBC hierarchy is variant-independent, i.e., dialect/MSA independent. Although as more variants
are added, some modifications may be needed.
Note that as the root, pattern, and vocalism are not ordered with respect to each other, they
are simply juxtaposed. The + sign indicates the ordering of affixival morphemes. Only now are the
AMs translated to concrete morphemes (CMs), which are concatenated in the specified order. Our
example becomes:
Simple interdigitation of root, pattern and vocalism then yields the form iztahar+at. This form is
incorrect since no morphological rules have been applied yet.
Rules Magead has two types of rules. Morphophonemic/phonological rules map from the morphemic
representation to the phonological and orthographic representations. Orthographic rules rewrite only
the orthographic representation. These include, for example, rules for using the shadda (consonant
5.2. MORPHOLOGICAL ANALYSIS AND GENERATION 75
doubling diacritic). For our example, we get /izdaharat/ at the phonological level (see Section 4.2.4).
Using standard MSA diacritized orthography, our example becomes Aizdaharat (in transliteration).
( 9'$ Azdhrt. Note that in analysis mode,
Removing the diacritics turns this into the more familiar 4.@P
Magead hypothesizes all possible diacritics (a finite number, even in combination) and performs
the analysis on the resulting multi-path automaton.
For a fuller discussion of Magead, see [92, 80, 101].
(5.5) ["prosper","flourish"]
Verb [] [] [] [VIII]
izdahar "z h r" IFtaCaL
VP-A-3FS-- izdaharat "z h r" IFtaCaL |<< "at"
The first line above is the English gloss. The second line summarizes the information of
the lexical entry for the lexeme. In this case, the three bracket pairs after "Verb" would list any
lexically dependent or exceptional perfective, imperfective, and imperative verb stems, but they are
empty in our example because this information is inferred internally by the ElixirFM system.
The final [VIII] indicates explicitly for the user that the pattern IFtaCaL belongs to the Form VIII
derivational class. The third line indicates the lemma, root and lemma pattern (which happens to
be the same as the pattern of the analyzed word in this example). The last line indicates the POS
(See Section 5.4.5), the phonological form of the word, the root, the pattern and the suffix. Note
that the pattern associated with the verb has the unassimilated t in it.
76 5. COMPUTATIONAL MORPHOLOGY TASKS
Phonology and Orthography Another unique feature of ElixirFM is that it internally represents
its lexical items in a phonemic representation, which is then converted into a string of characters in
the extended ArabTEX [16] notation. This notation can then be further converted into either Arabic
orthography or phonetic transcription. This allows ElixirFM to avoid defining orthographic rules,
and it basically separates phonology from orthography in a way similar to Magead (Section 5.2.4).
["book"]
Noun [FuCuL] []
kitAb "k t b" FiCAL
li-al-kutubi
The core of ElixirFM is written in the functional programming language Haskell, while
interfaces supporting lexicon editing and other interactions are written in Perl. See Appendix D for
links to ElixirFM and its online interface.
5.3 TOKENIZATION
The common wisdom in NLP is that tokenization of Arabic words through decliticization and
reductive orthographic normalization is helpful for many applications such as language modeling
(LM), IR and statistical MT (SMT). Tokenization and normalization reduce sparsity and perplexity
and decrease the number of out-of-vocabulary (OOV) words.
5.3.2 DETOKENIZATION
In certain contexts, when Arabic is the output language, it is desirable to produce proper Arabic
that is orthographically correct; i.e., tokenized and orthographically normalized words should be
detokenized and enriched (orthographically corrected). As an example, the output of English-to-
Arabic MT systems is reasonably expected to be proper Arabic regardless of the preprocessing used
to optimize the MT performance. Anything less is comparable to producing all lower cased English
or uncliticized and undiacritized French. Detokenization may not be a simple task because there are
several morphological adjustments that should be applied in the process [99, 13, 106]. Obviously,
the more complex the tokenization, the harder is detokenization.
Figure 5.3: Example with different tokenization schemes: ST/D0 simple tokenization, ONEnr (en-
riched orthographic normalization), ONRed (reduced orthographic normalization), D1, D2, D3/S1 and
S2 (different degrees of decliticization), WA (wa+ decliticization), TB and TBold (new and old Arabic
Treebank tokenization, respectively), MR (morphemes), LEM (lemmatization), LEM+TB (lemmatiza-
tion with TB) and ENX (a tokenization equivalent to D3+LEM+POS with markers for verbal subject).
. ;*+G.(, !c' (>9;,+ .,$ - )(*&}- b+,.&'
= !7^_*$ |B
+
Input (ST/D0) wsynh Alryys jwlth bzyArh Al trkyA .
Gloss and will finish the president tour his with visit to Turkey .
English The president will finish his tour with a visit to Turkey.
Scheme
ONEnr wsynhy Alryys jwlth bzyArh Al trkyA .
ONRed wsynhy Alryys jwlth bzyArh Aly trkyA .
D1 w+ synhy Alryys jwlth bzyArh Al trkyA .
D2 w+ s+ ynhy Alryys jwlth b+ zyArh Al trkyA .
D3/S1 w+ s+ ynhy Al+ ryys jwlh +h b+ zyArh Al trkyA .
S2 w+s+ ynhy Al+ ryys jwlh +h b+ zyArh Al trkyA .
WA w+ synhy Alryys jwlth bzyArh Al trkyA .
TB w+ s+ ynhy Alryys jwlh +h b+ zyArh Al trkyA .
TBold w+ synhy Alryys jwlh +h b+ zyArh Al trkyA .
MR w+ s+ y+ nhy Al+ ryys jwl +h +h b+ zyAr +h Al trkyA .
LEM nh ryys jwlh zyArh Al trkyA .
LEM+TB w+ s+ nh ryys jwlh +h b+ zyArh Al trkyA .
ENX w+ s+ nhV BP +S3MS Al+ ryysNN jwlhNN +h b+ zyArhNN AlI N trkyANNP .
D1, D2, and D3: Decliticization (degree 1, 2 and 3) are schemes that split off clitics. D1
splits off the class of conjunction clitics (w+ and f+) and the infrequent interrogative clitic. D2 is the
same as D1 plus splitting off the class of particles (l+, k+, b+ and s+). Finally, D3 splits off what D2
does in addition to the definite article Al+ and all pronominal enclitics.
WA: Decliticizing the conjunction w+. It is similar to D1, but without including f+. This
simple tokenization is reported to be optimal for SMT with very large data sets [98].
5.4. POS TAGGING 79
TB: Penn Arabic Treebank Tokenization. This is the same tokenization scheme used in the
Arabic Treebank [9]. This is similar to D3 but without the splitting off of the definite article Al+.
An older version of TB did not split the future particle s+.
S1 and S2 are schemes used by [99]. S1 and S2 are essentially the same as D3. S2 joins the
various proclitics in one string.
MR: Morphemes. This scheme breaks up words into stem and affixival morphemes. It is
identical to the initial tokenization used by [108].
LEM: Lemmas. This scheme reduces every word to its lemma. Lemmas can also be used
with other tokenization schemes where they are used for each split token; see LEM+TB in Figure 5.3.
ENX: English-like tokenization used by [105]. This scheme is intended to minimize dif-
ferences between Arabic and English. It decliticizes similarly to D3 but uses Lemmas and POS tags
instead of the regenerated words. The POS tag set used is the Bies reduced Arabic Treebank tag
set (Section 5.4.2) [9, 109]. Additionally, the subject inflection is indicated explicitly as a separate
token. Obviously, many other variations are possible here.
Nominals:
Nouns: NN singular common noun or abbreviation, NNS plural/dual common noun, NNP
singular proper noun, NNPS plural/dual proper noun
Pronouns: PRP personal pronoun, PRP$ possessive personal pronoun, WP relative pronoun
Other: JJ adjective, RB adverb, WRB relative adverb, CD cardinal number, FW foreign
word
Verbs: VBP active imperfect verb, VBN passive imperfect/perfect verb, VBD active perfect verb,
VB imperative verb
The following punctuation marks are given a tag that corresponds to their exact form: [,], [:],
[.], [], [-LRB-],9 and [-RRB-].
The following nouns and adjectives are marked explicitly: quantifier nouns
(NOUN_QUANT), comparative adjectives (ADJ_COMP), adjectival/ordinal numbers
(ADJ_NUM) and deverbals (DV).
Demonstratives and definite article are distinguished as DEM and DT, respectively.
The presence of a definite article (DT) is indicated in the tag, e.g., DT+NN, DT+ADJ_COMP,
DT+CD, and DT+JJ.
8 Although sometimes comma, period and colon can be POS tagged as themselves, e.g., the tag for the comma is [,].
9 -LRB- is left round bracket; and -RRB- is right round bracket.
5.4. POS TAGGING 83
The Extended Reduced Tag Set (ERTS)
The Extended Reduced Tag Set (ERTS) is the base tag set used in the Amira system (see
Section 5.5.2). ERTS has 72 tags. It is a subset of the full Buckwalter morphological set defined over
tokenized text. ERTS is a superset of the Bies/RTS tag set. In addition to the information contained
in the Bies tags, ERTS encodes additional morphological features such as number, gender, and
definiteness on nominals only. Definiteness (or precisely here the presence of the definite article) is
marked as a binary feature with D (for present article) or (nothing) for no article. Gender is marked
with an F, an M or nothing, corresponding to feminine, masculine or the absence of gender marking,
respectively. Number is marked with (Du) for dual or (S) for plural. The absence of any labels is
used for singular. For example, while Bies nouns are tagged as either NN or NNS, indicating only
number, ERTS nouns tags represent definiteness and gender in addition to number, e.g., DNNM
is a definite (i.e., with article) singular masculine noun. A full description of ERTS is presented in
[111]. The ERTS set was shown to be taggable at the same accuracy of the Bies tag set but adding
much more value as learning features to a higher order computational task, Base Phrase Chunking
[111].
The F eature part of the tag consists of seven character string. Each character efficiently
encodes the value of the feature assigned to the character position:
5.5.1 MADA+TOKAN
Mada (Morphological Analysis and Disambiguation for Arabic) is a utility that, given raw Arabic
text, adds as much lexical and morphological information as possible by disambiguating, in one
operation, part-of-speech tags, lexemes, diacritizations and full morphological analyses [51, 35,
116]. Madas approach distinguishes between the problems of morphological analysis, which is
handled by a morphological analyzer (Almorgeana), and morphological disambiguation. Mada
is a morphological disambiguation system. Once a morphological analysis is chosen in context, so
are its full POS tag, lemma and diacritization (all in a single step). Knowing the morphological
analysis also allows for deterministic tokenization and stemming, which are handled by Tokan once
Mada has finished processing the text.
MADA Mada operates in stages. First, it uses Almorgeana internally to produce a list of po-
tential analyses for each word encountered in the text; at this point, word context is not considered.
Mada then makes use of up to 19 features to rank the list of analyses. For each feature, a classifier is
used to create a prediction for the value of that feature for each word in its context. Fourteen of the
features use Support Vector Machine (SVM) classifiers; the remaining features capture information
such as spelling variations and n-gram statistics. Each classifier prediction is weighted using a tun-
ing set, and the collection of feature predictions is compared to the list of potential morphological
analyses. Those analyses that more closely agree with the weighted set of feature predictions receive
higher ranking scores than those which do not; the highest scoring analysis is flagged as the correct
analysis for that word in that context. Since Mada selects a complete analysis from Almorgeana,
all decisions regarding morphological ambiguity, lexical ambiguity, tokenization, diacritization and
POS tagging in any possible POS tag set are made in one fell swoop. Mada has over 96% accuracy
on lemmatization and on basic morphological choice (including tokenization but excluding syntactic
87
5.5. TWO TOOL SUITES
Figure 5.6: A comparison of several POS tag sets for the sentence $ ' H,+ ' !2$ )*
!+7; ( j $( $ $ Z[= $ $ $
+ +?Y- [' ;**b,+ DE 'B 9'9 ;| k&' Xa
xmswn Alf sAyH zArwA mdyntnA Aljmylh fy Aylwl AlmADy 50 thousand tourists visited our beautiful city last September.
Arabic Gloss Buckwalter/PATB CATiB Bies Kulick ERTS Khoja PADT Almorgeana/Mada
$ $
Xa fifty NOUN_NUM+ NOM CD CD CD NNuCaPlM QL------1I POS:NUM +MASC
xams+uwna NSUFF_MASC_PL_NOM +PL +NOM
$
k&' thousand NOUN_NUM+ NOM CD CD CD NNuCaSgM QM-----S4R POS:NUM
alf+a CASE_DEF_ACC +DEF +ACC
Z[= ;| tourist NOUN+ NOM NN NN NNM NCSgMGI N------S2I POS:N
sAyiH+ CASE_INDEF_GEN +INDEF +GEN
'B 9'9$ visited PV+ VRB VBD VBD VBD VPPl3M VP-A-3MP-- POS:V
zAr+uwA PVSUFF_SUBJ:3MP +PV +S:3MS
)(*,$ + DE city NOUN+ NOM NN NN NNF NCSgFAI N------S4R POS:N
madiyn+ah+a NSUFF_FEM_SG+ +FEM +SG
CASE_DEF_ACC +DEF +ACC
;,$ + our POSS_PRON_1P NOM PRP$ PRP$ PRP$ NPrPPl1 S----1-P2- +P:1P
+nA
( ?Yj ['
)* beautiful DET+ADJ+ NOM JJ DT+JJ DJJF NASgFAD A-----FS4D POS:AJ Al+
+-
Al+jamiyl+ah+a NSUFF_FEM_SG+ +FEM +SG
CASE_DEF_ACC +DEF +ACC
!+2$ in PREP PRT IN IN IN PPr P--------- POS:P
fiy
H,+ ' September NOUN_PROP+ PROP NNP NNP NNPM Rmy N------S2I POS:PN
ay.luwl+a CASE_INDEF_GEN +INDEF +GEN
$
!+7;' past DET+ADJ NOM JJ DT+JJ DJJM NASgMGD A-----MS2D POS:AJ Al+
Al+mADiy
. . PUNC PNX PUNC . PUNC PU G--------- POS:PX
.
88 5. COMPUTATIONAL MORPHOLOGY TASKS
case, mood, and state). Mada has over 86% accuracy in predicting full diacritization (including syn-
tactic case and mood). Detailed comparative evaluations are provided in the following publications:
[51, 35, 116] .
The operation of Mada is versatile and highly configurable. Starting with version 2.0, Mada
applies weights to each of the 19 features it uses for better accuracy; these weights are determined on
a tuning set and are optimized for different purposes, such as tokenization, diacritization, or POS
tagging.These weight sets are included with the package and should be chosen by the user depending
on how Mada will be used. However, users can also choose to set these weights directly themselves.
By default, Mada attempts to rank complete analyses in terms of overall correctness. By choosing
an alternative feature and weight set, it is possible to have Mada focus more specifically on getting
a particular analysis aspect correct. For example, users can achieve a 0.4% absolute improvement in
POS tagging accuracy if they use the weight set that was tuned for POS tagging, as opposed to the
default set. However, the accuracy of the other Mada outputs (the lexeme prediction, for example)
may suffer. Mada also includes a morphological back-off procedure, which can be turned on or off
by the user.
TOKAN Tokan is a general tokenizer for Arabic that provides an easy-to-use resource for tok-
enizing Mada-disambiguated Arabic text into a large set of possibilities [83, 97]. The decision on
whether an Arabic word has a conjunction or preposition clitic is made in Mada, but the actual
tokenization of the clitics including handling various morphotactics and spelling regularization is
done in Tokan. The tokenization scheme can be used as parameter in machine learning for a variety
of applications, such as machine translation or named-entity recognition.
Tokan takes as input a Mada-disambiguated file and a tokenization scheme description that
specifies tokenization target. Consider the following specification:
This scheme separates conjunctions, prepositions, verbal particles, the definite article and
pronominal clitics and it adds the basic POS tag to the form of the word. The scheme also spec-
ifies that diacritics are generated. An analysis of the word ;^_-(,;*+|B wasayukAtibuhA and he will
correspond with her would be tokenized as wa+ sa+ yukAtibu/V +hA. A simpler scheme such
as w+ f+ REST would simply produce w+ sykAtbhA. See [83, 105] for a detailed description
of several schemes that have become commonly followed since that work was published. Tokan
has a large number of other features that allow the user to perform different kinds of orthographic
normalizations or control how the output is ordered and presented as it may fit different needs of
different systems. All of the tokenization schemes shown in Figure 5.3 are supported by Tokan.
Internally, Tokan uses morphological generation (through Almorgeana) to recreate the
word once different clitics are split off. This approach of back generation allows us to modify the
morphological content in a word including, for instance, deleting/defaulting specific features of a
word easily. This ensures that the form of the generated word is normalized and consistent with
other occurrences of that word. For example, simply splitting the pronominal clitic off a word with
5.5. TWO TOOL SUITES 89
(
Ta-Marbuta ( > h) would keep the Ta-Marbuta in its word-internal form (regular letter Ta, 4 ( t).
(
With Tokan, the Ta-Marbuta is generated as appropriate. For example, )*&}- jwlth his-visit is
( ( - jwlt+h (which is not a valid spelling).
tokenized into >+ )&}- jwlh+h visit +his, not >+ F&}
MADA+TOKAN for NLP Applications Mada+Tokan has been used by numerous academic and
commercial research institutes around the world. Here are some examples of it use. In the context
of machine translation (MT) from Arabic to English, [83] and [105] explored the use of different
preprocessing schemes and their combination. Their results have been followed by different groups
of researchers working on Arabic-English MT [124, 125, 126]. [127] explored the use of Mada-
generated diacritizations for MT. [107] improved automatic word alignment for Arabic-English
MT using combinations of different tokenization schemes generated by Mada+Tokan. See [97]
for more details on different representations of Arabic morphology for MT. [99] used Mada in the
context of English-to-Arabic MT. Mada has also been used to produce features for Named Entity
Recognition (NER) [128, 129].
5.5.2 AMIRA
Amira is a set of tools built as a successor to the Asvmt toolkit developed at Stanford University
[109] and described in detail in [117]. The toolkit includes a tokenizer, a part of speech tagger (POS)
and a base phrase chunker (BPC), also known as a shallow syntactic parser. We focus in this section
on Amira-Tok and Amira-Pos. The technology of Amira is based on supervised learning with
no explicit dependence on knowledge of deep morphology; hence, in contrast to Mada, it relies on
surface data to learn generalizations. In general, the tools use a unified framework which casts each
of the component problems as a classification problem. The underlying technology uses Support
Vector Machines in a sequence modeling framework.
AMIRA-TOK Amira-Tok focuses primarily on clitic tokenization. Amira tools do not rely on
morphological analysis or generation tools in any of its processes. Hence, Amira-Tok learns clitic
tokenization generalizations from the clitic segmentations present in the Penn Arabic Treebank
(PATB), directly without relying on rules explicitly.
$
Amira-Tok segments off the following set of clitics: conjunction proclitics + B w+, + U f+,
prepositional proclitics + W k+, + H l+, + 4
- b+, future marker proclitic + Q s+, verbal particle proclitic + H
l+, definite article proclitic + H' Al+, and pronominal enclitics indicating possessive/object pronouns.
The particular insight of the Amira-Tok solution is to treat tokenization of Arabic words
as a character-level chunking problem. This allows using IOB syntactic chunking solutions usually
used at the phrase level on the sub-word level. Here, every character (including punctuation) is
annotated as: inside a chunk (I), outside a chunk (O), or beginning of a chunk (B), hence the name
IOB. For the I and B tags, there are five possible classes: Prefix 1 (e.g., conjunction proclitic), Prefix
2 (e.g., preposition), Prefix 3 (e.g., definite article), Word, Suffix (e.g., pronominal enclitic). This
leads to a total of 11 classes in the data: O, B-PRE1, I-PRE2, B-PRE2, I-PRE2, B-PRE3, I-PRE3,
90 5. COMPUTATIONAL MORPHOLOGY TASKS
B-WORD, I-WORD, B-SUFF, I-SUFF. By learning how to assign these class labels, Amira-Tok
learns how to segment the words.
Amira-Tok does not produce stemmed words that are not valid Arabic words.The tool applies
some heuristics to reverse the effect of morphotactics such as the loss of ' A in the definite article + H'
Al+ when in the context of the proclitic preposition + H l+ for. Most of such morphotactic restorations
are deterministically applied. However, non-deterministic morphotactics such as those involving
the nominal feminine marker (Ta-Marbuta) and the Alif-Maqsura are automatically determined
through another layer of learning to the problem of classifying word-final letters. For Ta-Marbuta:
( t either remains a regular 4( t or is converted to (> h. And for Alif-Maqsura: a stem
a stem final 4
final ' A either remains an ' A or is converted to a C .
Although the primary Amira tokenization is to split off clitics and normalize the stem, the
tool interface allows a limited number of variants, which include the level of clitic segmentation, and
whether tokenization is indicated with spaces (changing the token count) or with a plus sign only
(preserving token count). For example, the word PJ*-&B wllblAd, and for the countries, can have
the following tokenizations among many others: w+ l+ Al+ blAd (Amira-Tok internal), w+ llblAd
(Conjunction-only), w+l+ AlblAd (Preposition-only), wl+Al+ blAd (Al-only), wllblAd (Suffix only),
and wll+ blAd (All Prefixes+Suffix).
Amira-Tok performs at a high F-score measure of 99.2% [117].
AMIRA-POS Amira primarily uses the ERTS POS tag set and assumes the text is clitic tok-
enized. POS tagging in Amira-Pos is done through an SVM-based classification approach using
character n-grams as features in the sequence models.
The user has the flexibility to input raw or tokenized text in a scheme that is consistent with
one of the schemes defined by Amira-Tok. Consequently, the user may request that the POS tags
be assigned to the surface forms. Internally, in case of the raw input, Amira-Pos runs Amira-Tok
on the raw text and then performs POS tagging. The output can be presented as tokenized and POS
tagged, or without tokenization where the POS tag is assigned to the surface words. In this latter
case, the ERTS tag set is appended with the clitic POS tags to form more complex POS tags. The
user can choose to either tag with ERTS or RTS (Section 5.4.2).
Interestingly, the accuracy of the ERTS tagger is 96.13% and the accuracy of the RTS tagger
is 96.15%. This suggests that the choice of information to include in ERTS tag set reflects a natural
division in the syntactic space. The richer tag set (ERTS) has been shown to improve the quality of
downstream processing such as base phrase chunking [111, 120].
AMIRA for NLP Applications Amira has been successfully used by several groups in the context
of text MT, specifically for alignment improvement and reordering within the context of statistical
MT [130], and also for identifying difficult source language text [131]. Moreover, the Amira suite
was used in the context of speech MT [132]. The Amira suite was explored for the purposes of
cross language information retrieval in work by [104]. Amira has been used to produce POS tag
and BPC features for Arabic named entity recognition (NER) [129, 133].
5.5. TWO TOOL SUITES 91
5.5.3 COMPARING MADA+TOKAN WITH AMIRA
In this section, we compare and contrast Mada+Tokan and Amira in terms of their design,
functionality and performance.
Design As for their design, it may help to contextualize the different tools in terms of their
basic use in two suites: the Mada suite and the Amira suite.Within the Mada suite, there is an
explicit morphological analysis step handled by Almorgeana. The second, in fact core, component
in the Mada suite, is the Mada system, which disambiguates the analyses produced by the morpho-
logical analyzers. Finally, the Tokan component makes use of the morphological generation power
of Almorgeana to tokenize the disambiguated analysis through regeneration. In the Amira suite,
the two components focus on tokenization (Amira-Tok) and POS tagging (Amira-Pos).
In term of their design, Amira-Tok and Amira-Pos are different from the Mada suite
in that they take a two-step approach to POS tagging: tokenize then tag. In comparison, Mada
has a different approach that breaks the problem into three steps (analyze, disambiguate, generate),
which are orthogonal to Amiras split. Although there are three steps in Mada, the decision for
tokenization and POS tagging is done together in one-fell-swoop. One way of distinguishing these
tools is in terms of the depth of linguistic knowledge needed. Amira is shallow in that it focuses on
form-based morphology (specifically cliticization) learned from annotated data; whereas Mada has
access to deeper lexically modeled functional morphology. Another difference between the current
Mada suite and the Amira suite is that the former may produce no analysis for a given word if it
does not exist in the underlying morphological tools (although typically analysis back-off is used in
such cases) while the Amira suite always produces a hypothesized tokenization and POS tag for
every word in the text.
In terms of their training needs, the Mada suite expects the presence of both a morphological
analyzer and training data for supervised learning, whereas the Amira suite only needs annotated
training data. The training data could be created through any number of ways, including the use of
morphological analyzers followed by human annotation, but this is not a requirement for the Amira
suite. These different yet similar requirements put similar limits on the kind of extensions that could
be done in either approach. For example, going to an Arabic dialect would require the presence of
some morphological analyzer/generator for the dialect for Mada, but not Amira. However, both
need some amount of annotated data to train on.
Functionality In terms of functionality, we consider five applications: tokenization, diacritiza-
tion, POS tagging, lemmatization and base-phrase chunking. Base-phrase chunking is only handled
in the Amira suite, but it is in fact a separate module that can be used independently with the
Mada suite. The other four applications are handled at once in Mada as part of its common
morphological disambiguation process. Amira does not handle lemmatization or diacritization. As
for tokenization and POS tagging, since Mada goes deeper than Amira, a wider set of possible
tokenization schemes and POS tags can be output by Mada. Although Amira is more limited
by comparison, it does handle the most commonly used tokenizations and POS tags. Researchers
interested in exploring a large number of different sets of tokenizations as features in their systems
92 5. COMPUTATIONAL MORPHOLOGY TASKS
should consider Mada. Researchers only interested in limited comparisons or specific applications,
whose tokenizations and POS tags are supported by Amira, should consider Amira.
Performance It is hard to compare the performance of Amira and Mada suites. Previous
attempts by [51] show that similar performance is possible on tasks that are shared: specific PATB
tokenization and POS tags. Amira can be significantly faster than Mada; however, Mada needs
to be run only once and a much larger number of tokenizations and POS tags (in addition to other
outputs not supported by Amira) can be produced by running the fast Tokan step.
93
CHAPTER 6
Arabic Syntax
Syntax is the linguistic discipline interested in modeling how words are arranged together to make
larger sequences in a language. Whereas morphology describes the structure of words internally,
syntax describes how words come together to make phrases and sentences.
Much of the vocabulary discussing syntax is shared across different languages, e.g., verb, verb
phrase, subject and object. There are some exceptions that pertain to unique structures that are not
found cross-linguistically, e.g., Idafa and Tamyiz in Arabic. We discuss specific and general terms of
syntax as needed in this chapter. For a general introduction to syntax, we urge the reader to consider
the numerous publications available, e.g., [134, 135] among others.
This chapter is organized as follows. We first present a sketch of Arabic syntax in Section 6.1.
In Section 6.2, we present three Arabic treebanking projects and compare the different approaches
they use. Finally, Section 6.3 summaries research efforts in syntactic parsing of Arabic.
Non-pronominal subjects appear after the verb. The verb agrees with the subject in person
( 3rd) and gender (Masc or Fem) but not number, which defaults to Sg. A verb with a non-
pronominal subject in a V-Sent is never Pl. The subject receives the Nom case.
katab+a Al+walad+u/Al+wlAd+u
h iBIi ' Dh i&&'
PI i Fi (*iGi
-
wrote+3MS the+boy+Nom/the+boys+Nom
The boy/boys wrote
As we saw earlier in Section 4.2.1, pronominal objects appear as part of verbal suffixes re-
gardless of whether the subject is pronominal or not. Here are the above two constructions with
pronominal objects:
masculine/feminine affixes which are used in the number-blind agreement in V-Sent. A concept of a 3rd person singular
( $
masculine/feminine hidden pronoun (.5(*aE .5+) is introduced to explain 3rd person singular masculine/feminine pro-drop.
6.1. A SKETCH OF ARABIC SYNTACTIC STRUCTURES 95
i ii
katab+nA -hA ;@i - ;*b$ - (*G
wrote+1P -it
We wrote it
Non-pronominal verb objects typically follow the subject. The object receives the Acc case.
Given that case endings are not always written, there is a common ambiguity associated with
the sequence [Verb+3S NounPhrase] when the Verb and NounPhrase agree in gender: (a.) the
NounPhrase is the subject or (b.) the subject is pronominal (3MS or 3FS) and the NounPhrase is
the object.
As in other languages, Arabic has intransitive, transitive and ditransitive verbs that take zero,
one or two objects, respectively. The direct and indirect objects of a ditransitive verb both receive the
Acc case.
(6.7) a. V-Sent: Verb SubjectNom IObjectAcc DObjectAcc
i ii i
AaTa Al+walad+u Al+bint+a kitAb+A [;,i- ;(*Gi F(i b*$ i- &' Dh &&' `<'
give+3MS the+boy+Nom the+girl+Acc book+Acc
The boy gave the girl a book
As in English, the ditransitive construction has an alternation where the indirect object appears
as object of the preposition [ l- to, with a Gen case:
However, if both direct and indirect objects are pronominal, the direct object appears as a
separate non-cliticizable direct pronoun after the subject.
Nominal Sentences
The prototypical Nominal Sentence (N-Sent) has the form of Subject-Predicate/Topic-
i $ =i
Complement (.5-Bi ' D(**-E mubtada wa+xabar).This is sometimes referred to as a copular construction
or equational sentence.
Nominal Sentence Variants In the simplest N-Sent, the subject is typically a definite noun, proper
noun or pronoun in the Nom case and the predicate is an indefinite Nom noun, proper noun or
adjective that agrees with the subject in number and gender.
6.1. A SKETCH OF ARABIC SYNTACTIC STRUCTURES 97
For the rest of this section, we will limit the glossing of morphological features to the minimum
needed. In addition to the basic nominal predicate form, the predicate can be a prepositional phrase
(PP):
The predicate can also be another N-Sent. In this construction, the subject of the top N-
Sent serves as a topic. The predicate of the top N-Sent will typically reference the topic using some
pronominal reference.
However, perhaps the most interesting predicate structure involves a V-Sent. Most com-
monly, this construction produces a Subject-Verb-Object look-alike order in Arabic when the sub-
ject of the embedded predicating V-Sent refers back to the subject of the main N-Sent. Here, the
subject and verb agree in full (gender, number and person) as opposed to agreeing in gender and
person as in a normal V-Sent. This construction is sometimes referred to as a complex sentence.
Contrast this following example with its base V-Sent variant.
As a result, there are three types of verbal constructions when it comes to how the subject is
expressed in Arabic: Verb-Subject, Subject-Verb and Verb+Subj.
The subject of the main N-Sent can be also referred to by other arguments and adjuncts
inside the predicating V-Sent, such as the V-Sent object or object of one of its prepositions.
Note how in the above example, the top N-Sent subject, the topic, is in Nom case regardless of
its co-reference inside the V-Sent. Arabic allows a variant construction of the example above where
the verb object is topicalized, moved, without change in its case. In this construction, no pronominal
reference in the V-Sent is needed. This is not a common construction.
A final note: if the subject is indefinite, the order of subject and predicate is reversed. This
often happens with prepositional phrase predicates.
6.1. A SKETCH OF ARABIC SYNTACTIC STRUCTURES 99
Adjectival Modification
Arabic adjectives follow the nouns they modify. Adjectives and nouns always agree in definiteness
and case. Adjectives of rational (Human) nouns agree in gender and number also. Broken plural
adjectives are form-wise singular and with ad hoc form-based gender, but they are functionally
plural (see Section 4.2.2). For example, the word
(>h.i i i i' Al+maharah+u is feminine and singular by
form but masculine and plural functionally.
e. AlkutAbu Almaharahu
(>h.i i i ' 4h ;g(*i]&'
h
-
the+authors+MP the+clever+MP
the clever authors
i$ h i h g(i h
f. AlkutAbu AlmAhiruwna XB .@i ; ' 4- ;*]&'
the+authors+MP the+clever+MP
the clever authors
While adjectives of irrational (non-human) nouns agree with the nouns in gender and number
when the nouns are singular or dual; adjectives of plural irrational nouns are oddly feminine singular.
Idafa Construction
The Idafa construction is a possessive/genitive construction relating two nouns: the first noun, the
$ $ muDAf), grammatically heads, and semantically possesses the second noun, the
possessor ( U;#E
i$
possessed ( )*+&' U;$ #E muDAf Ailayhi). The possessor is in the construct state. And the possessed has
a genitive case. This construction has many comparables in English: N oun1 N oun2 can translate
into Noun1 of Noun2 , Noun2 s N oun1 or a compound N oun2 N oun1 .
(6.25) N-Phrase: NOUN1construct NOUN2Gen
i h ( i$ i
(> 9;i g*ia&'
mafAtiyHu Al+sayArahi i + g *+,i ;%E
keys the+car
the keys of the car or the cars keys or the car keys
The two nouns together form a noun phrase which can be the second part of a different Idafa
construction. This can be extended recursively creating what is called an Idafa chain. All the words
in an Idafa chain except for the first word must be genitive. And all the words except for the last
word must be in construct state.
(6.26) N-Phrase: Idafa Chain
i gi (> 9'i P'i Yi b,= 9i 9 ;i \gi <i ,h$ '
)(i G.i03&' i =i i i - i + i i - - =i
Aibonu Eami jAri rayiysi majlisi AidArahi Alarikahi
son uncle neighbor chief committee management the-company
the cousin of the CEOs neighbor
Adjectives modifying the head of an Idafa construction agree with it in case, but they agree
with its dependent in definiteness:
(6.27) N-Phrase: NOUN1construct NOUN2Gen ADJ1
D,h + DiYi-j [' (> 9;i g*ia&'
i i
bAb+u Al+sayArahi Al+jadiyd+u i + g 4h- ;,-
door the+car+Gen the+new+Nom
the cars new door
bAb+u sayArah jadiyd+u D,f + Dii- (> 9;i g*i|i 4h ;,i
+ - -
door car+Gen new+Nom
a cars new door
102 6. ARABIC SYNTAX
In addition to basic possessive constructions, the Idafa construction is used in many linguistic
constructions in Arabic:
Quantification constructions such as F-(*]&' kulu Alkutubi all [of ] the books and
h h ( $
F-(*G )a xamsahu kutub five books.
Preposition-like adverbial constructions such as ( +*-&' 4- .2( qurba Albayti near the house.
Fb
Adjectival Idafa, also known as false Idafa )*(+%*( +%}
( .5<$ )(2;$ '$ , such as )E;
+
( %&'
( , Tawiylu
+
AlqAmahi tall of stature.
Tamyiz Construction
$ (i
The Tamyiz (.5+*+ tamyiyz or accusative of specification) construction relates two nouns. The first
i $ gi i $ gi i
noun, the specified (.5+?h ' Almumayaz) heads and governs the second noun, the specifier (.5+?h '
Almumayiz), which qualifies the first noun.The specifier is always singular in number and accusative
in case. Tamyiz is used in variety of linguistic constructions in Arabic:
i$ i i .h 3 i=
comparatives and superlatives such as ;;*+,- 5G ' Ak aru bayADA [lit. more as to whiteness]
whiter.
i
measurement specification such as ;?Yj *+G kylw laHmA [lit. a kilo as in meat] a kilo of meat,
i (i
or the common interrogative v ;,- ;*G 6G kam kitAbA? [lit. how many as in book?] how many
books?
i i i$ h i$
some number constructions such as ;,- ;(*G
i Xa xamsuwna kitAbA [lit. fifty as of book] fifty
books.
gi f6([ ;i$ xAtimu fiDah [lit. a ring as in silver] a sliver ring.
type specification such as )(#$ 2$ i
Apposition
An apposition construction ( HD,- badal) relates two noun phrases that refer to the same entity.
= =
The heads of the two noun phrases agree in case, e.g., ;E;,- B' W'9 ;,- w!+,+ .E I' b+,.&' Alryiysu
Almriykiyu, bArAk AwbAmA the American President, Barack Obama. A very common apposi-
tional construction in Arabic involves the demonstrative pronoun, which typically precedes the noun
i i$
(*]
it modifies although it can also follow: 4- i &' ' D@ hA AlkitAb [lit. this the-book] this book.
;
6.1. A SKETCH OF ARABIC SYNTACTIC STRUCTURES 103
Relative Clauses
Relative clauses modify the noun that heads them. If the heading noun is definite, the relative clause
(
( B )- Sila/linking sentence) is introduced and headed with a relative pronoun ( HE 678',
RELPRO). When prsent, the relative pronoun agrees with noun it modifies in gender and number
following Adjectival agreement rules (irrationality gets exceptional agreement).
(6.28) N-Phrase: NOUNdef inite RELPRO SENTENCE
h= g i
AlkitAbu Aliy [ uHibu -hu ] >h- Fgh- }i ' C+ D$ &' 4h- ;(*]i&'
the-book which [ I+love -it ]
the book [which] I love
If the heading noun is indefinite, the relative clause (called )(%
$ )
( Sifa/adjectival sentence
-
in this case) is not introduced with a relative pronoun.
(6.29) N-Phrase: NOUNindef inite SENTENCE
Nominal Arguments
i$
Verbal nouns in Arabic such as deverbal nouns (9D#E maSdar) and active participles ( <;2 678')
behave like verbs in that they can take an accusative object argument and other verbal modifiers.
Their nominal form allows them to additionally participate in some of the nominal constructions
discussed earlier, such as Idafa.
(6.31) N-Phrase: MASDARconstruct NOUN1Gen NP-OBJAcc
ii hi
marifahu Alrajuli AlHaqiyqaha )(%*( +%(i Yij [' ih- .&'
i )(2.$ i /Ei
knowning+Nom the+man+Gen the+truth+Acc
the mans knowledge of the truth
104 6. ARABIC SYNTAX
6.1.4 PREPOSITIONAL PHRASES
Arabic prepositional phrases consist of a preposition followed by a noun phrase. The head of the
noun phrase is in the genitive case.
Figure 6.1: The phrase structure representation in the Penn Arabic Treebank (PATB) for the sentence
$ ' H,+ ' !2$ ;,+ 9|B X;$ **$ - & 'B 9'9$ Z[= ;| k&'
!7; $ $ xmswn Alf sAyH zArwA lbnAn wswryA fy Aylwl
$ Xa
+ +
AlmADy 50 thousand tourists visited Lebanon and Syria last September.
(PATB)
S
NP-TPC1 VP
NP
NOUNN U M NP
MascPlNom
# (#
$%&' NOUNN U M NP
VERB NP-SBJ1 NP-OBJ PP-TMP
DefAcc
xmswn
NOUN PV3MP
fifty # NONE
)*" IndefGen #
Alf
"1 2"2 PREP NP
+,.- /0 NOUNP ROP CONJ NOUNP ROP
zArwA T
thousand
sAyH
visited DefAcc
+1
DefAcc
768#
NOUNP ROP DET+ADJ
tourist $/# 33# 4 * w+ /56 2%0 fy
Gen Gen
lbnAn and swryA in
Lebanon Syria 9%:56 " #
76;</=>"
Aylwl AlmADy
September past
VP-A-3MP
'B 9'9$ zArwA
visited
Sb Coord AuxP
QL1I C P
$ $ xmswn
Xa +B w+ !2$ fy
fifty and
+
in
Atr Obj_Co Obj_Co Adv
QMS4R NS4I NS4I NS2I
$ Alf
k&' X;$ **$ - & lbnAn ;,+ 9| swryA H,+ ' Aylwl
thousand Lebanon Syria September
Atr Atr
NS2I AMS2D
Z[= ;| sAyH $ AlmADy
!+7;'
tourist past
word is also modified by a noun in an attributive relation. The PADT does not distinguish between
Idafa, Tamyiz and adjectival modification they are all called Atr. The second child of the verb heads
two proper nouns with the composite relation Obj_Co, which indicates at once that the two proper
nouns are coordinated (Co) by their parent and that they both are objects (Obj) of their grandparent
$
verb. The last child of the verb, the preposition !2 fy in heads a proper noun H,+ ' Aylwl September
+
with the relation Adv (adverbial), which heads an adjective in an attribute (Atr) relation.The relation
Adv indicates how the month name modifies the main verb despite the presence of the preposition
in between the two. This highlights an important aspect of the analytical syntactic representation in
the PADT, namely that it is deeper and more semantically (specifically propositionally) aware than
other treebanks. For more information on PADT, see [144, 145, 143, 67, 138].
The initial version of PADT [145] contained around one hundred thousand words. PADT
was used in the CoNLL 2006 and CoNLL 2007 shared tasks on dependency parsing [146] and
its morphological data has been used for training automatic taggers [114]. The current version of
PADT (2.0) contains over one million tokens of PATB-converted trees and trees annotated for
PADT directly.
108 6. ARABIC SYNTAX
6.2.3 COLUMBIA ARABIC TREEBANK
The Columbia Arabic Tree Bank (CATiB) project started at Columbia University in 2008. It con-
trasts with previous Arabic treebanking approaches in putting an emphasis on faster production with
some constraints on linguistic richness [112, 121]. Two ideas inspire the CATiB approach. First,
CATiB avoids annotation of redundant linguistic information. For example, nominal case and state
in Arabic are determined automatically from syntax and morphological analysis of the words and
need not be annotated by humans. Of course, some information in CATiB is not easily recoverable,
such as phrasal co-indexation and full lemma disambiguation. Second, CATiB uses an intuitive
dependency structure representation and relational labels inspired by traditional Arabic grammar
such as Tamyiz and Idafa in addition to the well-recognized labels of subject, object and modifier.
This makes it easier to train annotators, who need not have degrees in linguistics.
There are eight syntactic relations used to label the dependency attachments in CATiB: subject
(SBJ), object (OBJ), predicate (PRD), topic (TPC), Idafa (IDF), Tamyiz (TMZ), modifier (MOD)
and flat (). SBJ marks the explicit syntactic subjects of verbs (active or passive), regardless of
whether they appear before or after the verb and subjects of nominal sentences. TPC is restricted
=
to the subject/topic ( 'D(**-E) of a complex nominal sentence whose complement is a verb with a
different subject. Typically, there is an object pronoun that refers back to the topic. The use of SBJ
and TPC is different in CATiB from PATB. MOD is the most common relation used to mark all
modifications such as adjectival modifications of nouns, adverbial modification and prepositional
phrase modification of nouns and verbs. The flat relation marks multi-word structures that cannot
be explained using any of the above relations. The most common case is the different parts of a
proper name, e.g., a last name is in a flat relation to a first name.
CATiB includes almost 1 million tokens: 270K tokens of annotated newswire text in addition
to converted PATB trees (parts 1, 2 and 3). Since the PATB has more information, conversion to
CATiB is feasible at a good degree of correctness [112]. All CATiB annotated sentences are taken
from a parallel Arabic-English corpus, so the sentences have translations associated with them.
Figure 6.3 presents and example of a sentence in CATiB. For the actual format of the CATiB
trees, see Figure 6.4. To some degree, the dependency representation is similar to that used in PADT
but with some very important differences (which we discuss in the next section). The head of the
sentence is the verb 'B 9'9$ zArwA visited. It has three children, a subject (SBJ), an object (OBJ) and
a prepositional modifier (MOD). The subject contains a complex number expression containing an
Idafa and Tamyiz relations. The object heads a coordinating conjunction particle, which heads a
coordinated conjunct. The third verb child, the preposition, governs an object (OBJ), which itself is
modified by an adjectival nominal. This simplicity and coarse-grained nature of the relations used
is the distinguishing mark of CATiB annotation compared to the other treebanking approaches.
VRB
'B 9'9$ zArwA
visited
ture (PS) and both CATiB and PADT use dependency structure (DS). See Figures 6.1, 6.2 and 6.3.
PS is a tree representation in which words in a sentence appear as leaves and internal nodes are
syntactic categories such as noun phrase (NP) or verb phrase (VP). DS is also a tree except that the
words in the sentence are the nodes on the tree [147]. In terms of linguistic content, we can further
distinguish the following categories of content.
Syntactic Structure PADT and CATiB annotate heads explicitly and spans of phrases/clauses
implicitly; whereas PATB annotates spans explicitly and heads implicitly. PATB uses intermediate
projections, such as VP, to represent certain syntactic facts.The DS treebanks, PADT and CATiB, use
other devices, such as attachment labels, to represent the same facts. PADT and CATiB approach
some structures differently. For example, in PADT, the coordination conjunction heads over the
$ **$ - &
different elements it coordinates as opposed to the way it is done in CATiB. See how ;,+ 9| +B X;
lbnAn w+ swryA Lebanon and Syria is represented in PADT and CATiB in Figures 6.2 and 6.3.
Syntactic and Semantic Functions PATB uses about 20 dashtags that are used for marking syntactic
and semantic functions. Syntactic dashtags include -TPC and -OBJ and semantic tags includes
-TMP (time) and -LOC (location). Some dashtags serve a dual semantic/syntactic purpose such
110 6. ARABIC SYNTAX
Figure 6.4: The internal representation of a syntactic tree in phrase structure (specifically PATB release
format) and dependency structure (specifically CATiB release format). These examples are paired with
the examples in Figures 6.1 and 6.3. The PATB trees are typically printed on a single line. The CATiB
trees are represented in five columns indicating word index, word, POS, parent word index and relation. The
Arabic words are represented in the Buckwalter transliteration scheme [88]. All glosses are additional.
Penn Arabic Treebank Example
as -SBJ which can mark syntactic subject of a verb and the semantic subject of a deverbal noun.
PATB does not explicitly annotate dashtags in some cases such as objects of prepositions or the
Idafa/Tamyiz constructions. These are implicitly marked through the syntactic structure. Idafa and
Tamyiz are identical in PATB except for the morphological case information, which can be used to
distinguish them. CATiBs relation labels mark syntactic function only.The use of the syntactic labels
SBJ and TPC is different between CATiB and PATB. In PATB, TPC is used to mark the subject
or object when they appear before the verb. Further co-indexation is used to specify the role of the
TPC inside the verb phrase. See how the subject is handled in Figure 6.1 and 6.3. The subject of a
verbless (non-complex) nominal sentence is marked as SBJ in both PATB and CATiB. PADT uses
around 20 labels, although with different functionality from PATB and CATiB. In general, PADT
analytical labels are deeper than CATiB since they are intended to be a stepping stone towards the
6.2. ARABIC TREEBANKS 111
PADT tectogrammatical level. For instance, dependents of prepositions are marked with the relation
they have to the node governing the preposition (the grandparent node). For example, in Figure 6.2,
H,+ ' Aylwl September is marked Adv (Adverbial) of the main verb 'B 9'9$ zArwA visited. Similarly,
$ **$ - & lbnAn w+ swryA Lebanon and Syria are marked as both
the coordinated elements ;,+ 9| +B X;
Co (coordinated) and with their relationship to the governing verb, Obj (object). PADT does not
distinguish different types of nominal modifiers, i.e., adjectives, Idafa and Tamyiz (in numbers) are
all marked as Atr (Attribute).
Empty Pronouns Empty pronouns are annotated in PATB but not PADT nor CATiB. Verbs with
no explicit subjects in CATiB (and PADT) can be assumed to pro-drop (implicit annotation).
Coreference Coreference indices are annotated in PATB for traces and explicit pronouns. PADT
only annotates coreference between explicit pronouns and what they corefer with. CATiB does not
annotate any coreference indices.
Word Morphology CATiB uses the same basic tokenization scheme used by PATB and PADT. As
for parts-of-speech, PATB uses over 400 tags specifying every aspect of Arabic word morphology
such as definiteness, gender, number, person, mood, voice and case. PADT morphology is more
complex than PATB. For instance, it makes more sophisticated distinctions on nominal and adjectival
definiteness/state, number, and gender. In contrast, CATiB uses six POS tags only. It is important to
point out that in most Arabic parsing work, a much smaller POS tag set is used, reducing the 400 or
so tags in PATB to a set between 20 and 40 tags [119]. [110] reports on simple regular-expression-
based extension to CATiBs tag set that produces competitive results. Some of the rich morphology
information not included in reduced POS tag sets, such as nominal case, can also be retrieved from
the tree structure because they are defined syntactically [79].
Despite the many differences, conversion between these different representation can be done
with a good degree of success given that the information is available in the tree although represented
differently. Since CATiB has less content than PATB and PADT, it is perhaps much easier to convert
from these two representations into CATiBs than the other way around.
The team behind the PATB has been extending its profile to include additional genres such
as Arabic used in broadcast news and conversation, telephone conversations and blogs. One
example of this is the Levantine Arabic treebank [140, 148].
The Quran Corpus project at the University of Leeds includes a treebanking effort targeting
the Quran (QuranTree). The representation used in this project is a hybrid of phrase and
112 6. ARABIC SYNTAX
dependency structures and is the closest to descriptions of traditional Arabic grammar [10,
149].
A version of the PATB at Dublin University uses a lexical functional grammar (LFG) represen-
tation. This treebank was automatically converted and does not include additional annotations
[150].
The Arabic Propbank (Propositional Bank) [151] and the OntoNotes project [152] annotate
for Arabic semantic information. We discuss the Arabic Propbank further in the next Chapter.
CHAPTER 7
N- $ xaluwj a female camel whose baby died. This situation is comparable to the numerous words
for horse in the English jargon of horse breeders, e.g., foal a baby horse still at its mothers side or
gelding a castrated male horse. In that respect, Arabic is no different again from other languages.
Figure 7.1: The various framesets associated with the Arabic verb :;2( qAm with examples.
The APB also defines 24 argument types, which include five primary numbered arguments
(ARG0, ARG1, ARG2, ARG3, ARG4) and 19 adjunctive arguments, which include ARGM-TMP
5 http://verbs.colorado.edu/propbank/framesets-arabic/
7.3. ARABIC WORDNET 115
(temporal adjunct) and ARGM-NEG (negation adjunct). The use of numbered arguments allows
a propbank to capture generalizations about framesets of a particular verb without having to select
from a restricted set of named thematic/semantic roles. An example of a fully annotated tree is
presented in Figure 7.2.6
Figure 7.2: The Arabic Propbank annotation of the Penn Arabic Treebank (PATB) for the sentence
$ ' H,+ ' !2$ ;,+ 9|B X;$ **$ - & 'B 9'9$ Z[= ;| k&'
!7; $ $ xmswn Alf sAyH zArwA lbnAn wswryA fy Aylwl
$ Xa
+ +
AlmADy 50 thousand tourists visited Lebanon and Syria last September. The main verb predicate, 9'9$
zAr visit has only one frameset with two arguments: ARG0 (entity visiting) and ARG1 (entity visited).
The NP-TPC in this example is indirectly assigned the ARG0 label through its common index with
NP-SBJ.
(PATB)
S
NP-TPC1 VP
ARG0
NP
NOUNN U M NP
# (#
$%&' VERB NP-SBJ1 NP-OBJ PP-TMP
NOUNN U M NP
xmswn PRED ARG0 ARG1 ARGM-TMP
fifty #
)*" NOUN
Alf
"1 2"2# NONE
thousand +,.- /0 zArwA
T NOUNP ROP CONJ NOUNP ROP
PREP NP
visited
sAyH
tourist $/# 33# 4 * +1 /56 2%0 768#
NOUNP ROP DET+ADJ
lbnAn w+ swryA fy
Lebanon and Syria in #
9%:56 " 76;</=>"
Aylwl AlmADy
September past
The Arabic Propbank has already been used by researchers in the task of Semantic Role
Labeling (SRL) [165, 166].
Figure 7.3: Paired synsets from Arabic WordNet and English WordNet.
Figure 7.4: An example of a sentence in Arabic and English with named entity recognition tags in XML.
$ '
. !7;
+ :;/&' !+2$ </GPE> X;$ **$ - &<GPE> </PER> $ 5+a} '<PER> 9'9$
zAr <PER>Almlk Hsyn</PER> <GPE>lbnAn</GPE> fy AlAm AlmADy.
<PER>King Hussein</PER> visited <GPE>Lebanon</GPE> last year.
[14] constructed a corpus with annotations for naturally occurring numerical expressions and
used it to evaluate a system for automatic detection of these expressions.
119
CHAPTER 8
h h
and nomenclature of, the SMT and RBMT camps can be deceptive since explicit linguistic rules
can be probabilistic and can be learned automatically. The last few years have witnessed an increased
interest in hybridizing the two approaches to create systems that exploit the advantages of both
linguistic rules and statistical techniques. The most successful of such attempts so far are solutions
that build on statistical corpus-based approaches by strategically using linguistics constraints or
features.
8.2.1 ORTHOGRAPHY
In terms of orthography, Arabics reduced alphabet with optional diacritics and common cliticizations
falls in between Spanish and English (both alphabets) on one hand and Chinese (complex system
with around 10,000 logographic characters) on the other. Arabic tokenization is far easier than
Chinese segmentation. But the two languages start to pose similar challenges when translating from
OCRed text. Arabic diacritic absence adds to the ambiguity of translating from Arabic, in general,
but it is especially problematic for proper name transliteration [21, 45, 50, 22]. The good news is
that when translating into Arabic, as opposed from Arabic, the absent diacritics in the output may
render some translation errors irrelevant.
1The four languages we discuss here are all resource-rich high-density languages. It is important to point out that Arabic dialects,
which are not part of this book, are technically resource-poor or low-density languages. The issue of resource density will not be
discussed here.
8.2. A MULTILINGUAL COMPARISON 121
Figure 8.2: A comparison of Arabic, Spanish, English and Chinese across six linguistic aspects. Ta-
ble legend: V=Verb, Subj=Subject, VSubj =Pro-dropped Verb, N=Noun, Adj=Adjective, Poss=Possessor,
Rel=Relative Clause.
Arabic Spanish English Chinese
Orthography optionally-reduced logographic
alphabet alphabet alphabet characters
Morphology very rich rich poor very poor
V Subj
Subject-Verb order VSubj VSubj
Subj V Subj V Subj V Subj V
Adjectival Modifier N Adj N Adj Adj N Adj N
N Poss N de Poss N of Poss
Possessive Modifier Poss s N Poss N
Poss N
Relative Modifier N Rel N Rel N Rel Rel N
8.2.2 MORPHOLOGY
Arabic stands as the most morphologically complex language compared. Arabic is followed by
Spanish, then English and finally Chinese, which is an isolating language with no morphology to
talk of. Arabic morphological complexity leads to a large number of possible word forms, which
results into the computational problems of increased sparsity and high degree of Out-of-Vocabulary
(OOV) terms. In a study by [50], almost 60% of OOV words in an Arabic to English MT system
were found to involve verbs, nouns and adjectives, many of which are unseen morphological variants
of infrequently seen words.
Arabic morphological complexity and its consequences are typically handled through auto-
matic tokenization to break up words into smaller units with less sparsity. The question of what is an
optimal tokenization has been explored by various researchers mostly working on Arabic-English
MT. Lee [108] investigated the use of automatic alignment of POS-tagged English and affix-stem
segmented Arabic to determine appropriate tokenizations of Arabic. [83, 105] conducted a large set
of experiments including multiple preprocessing schemes reflecting different levels of morphological
representation and multiple techniques for disambiguation/tokenization. Other results were reported
using specific preprocessing schemes and techniques by [181, 182, 183, 98]. Improvements for word
alignment was also shown using different morphological tokenizations [107]. In principle, different
optimal tokenizations can be used for different parts of an MT system so long they are coordinated.
122 8. A NOTE ON ARABIC AND MACHINE TRANSLATION
For example, lemmas can be used for automatic alignment, but some inflected decliticized form can
be used in the translation model. Various tokenization schemes are discussed in Section 5.3.
Translation into Arabic from other languages faces an added problem: the output needs
to be in a morphologically complex form even if some simplified form is used in the translation
models or dictionaries. Arabic detokenization or recombination has been demonstrated successfully
by [99, 13, 106].
8.2.3 SYNTAX
Figure 8.3: A pair of word-aligned Arabic and English sentences. The Arabic syntactic representation,
provided for illustrative purposes, is in CATiB style annotation.
?@A,B.
#
B D1EF=>"
C/ B E# 6G=>"
C"
H3B63# I*"
# H3B6JK# &*"
L 9
#
M# G65/# 3# I*"
h h
Arabic is a morphosyntactically complex language with many differences from Spanish, En-
glish and Chinese. We describe here four syntactic phenomena: subject-verb order, adjectival mod-
ification, possessive modification, and relative modification. Figure 8.3 illustrates some of these
phenomena in an Arabic to English context.
Arabic verb subjects may be: (a.) pro-dropped (verb conjugated), (b.) pre-verbal, or (c.) post-
verbal. Each situation comes with its own morphosyntactic restrictions. Spanish also allows pro-drop
in similar contexts to Arabic, but unlike Arabic, Spanish does not have an option for a Verb-Subject
order. English and Chinese are both generally Subject-Verb languages. Given the three possibilities
for where the subject can go, when translating from Arabic, the challenge is to determine whether
there is an explicit subject and, if so, whether it is pre- or post-verbal. Since Arabic objects also
8.3. STATE OF THE FIELD OF ARABIC MT 123
Figure 8.4: An example of long distance reordering of Arabic VSO order into English SVO order
$ XB;/
$ SUB] [ !"*+Yj ['
[... X'
(
$ (*&' Y- HBP $ 5+,- D,+ DYj [' )]a&' ( b$ ' NP-SBJ] [ <'
TB.03 :;/&' a $ V]
+-
[V A ln] [NP-SBJ Almnsq AlAm lmrw Alskh AlHdyd byn dwl mjls AltAwn Alxlyjy] [SUB An ...]
[NP-SBJ The general coordinator of the railroad project among the countries of the Gulf Cooperation Council]
[V announced] [SUB that ...]
follow the verb, a sequence of Verb and a noun phrase may be a Verb-Subject or a pro-dropped Verb-
Object. The problem is exacerbated with very long subjects that can themselves be split mistakenly
into smaller noun phrases. This is a challenge to both SMT systems (with possible limited phrase
window size) and RBMT systems, which may make syntactic parsing errors. See Figure 8.4 for a
ten-word subject example.
Translating from any of the other languages to Arabic may in principle be easier since main-
taing the original word order is acceptable in Arabic. This may be true syntactically, but it will have
some consequences on perceived fluency and textual flow in Arabic.
Arabic and Spanish nominal modifiers of all types (adjectival, possessive and relative) follow
the noun they modify. Chinese is consistent also, but in the opposite order. Chinese uses the function
particle de for marking all modification structures. Spanish also uses function words: a preposition
(coincidentally also de) for marking possessive structures and relative pronouns, e.g., que, for relative
modification. Arabic, however, depends more on subtle coordination of definite articles to distin-
guish adjectival and possessive (Idafa) modification. As for relative structures, indefinite relative
modification in Arabic forbids the presence of a relative pronoun, which leads to structural ambigu-
ity comparable to the English: the man wanted (by Mary)/(to go). While Englishs Verb-Subject order
is simple, English nominal modification phenomena are all over the place. In some cases, English is
closer to Arabic or Spanish and in others it is closer to Chinese. In particular, English has a lot of
variety in its possessive construction. For example, the English phrases the car keys, the cars keys and
( $
the keys of the car all translate into the Arabic >9;*+a&' *+(,;%E mfAtyH AlsyArh [lit.] keys the-car. In
contrast, Arabic has a lot of variety in Verb-Subject order, but not in nominal modification order.
Much work is going on in terms of syntactic modeling for MT, in general, and for Arabic-
English [184, 130, 185, 186, 156, 187] and English-Arabic [188, 189], in particular.
APPENDIX A
Elsnets List of pointers to Arabic and other Semitic NLP and Speech sites
MEDAR: 2009 Arabic HLT Survey, 2005 Arabic HLT Survey, BLARK and Archive
Linguistlist on Arabic
NLP-4-Arabic webpage
Linguist List
Georgetown University Round Table on Arabic Language and Linguistics (GURT 2010)
International Symposium on Computer and Arabic Language (ISCAL 2009, ISCAL 2007)
Workshop on HLT & NLP within the Arab World (LREC 2008)
NLP track in the International Conference on Informatics and Systems (INFOS 2010, IN-
FOS 2008)
A.2. NETWORKING AND CONFERENCES 127
Colloque International sur le Traitement Automatique de la Langue Arabe (CITALA) (Rabat,
2007)
The Challenge of Arabic for NLP/MT Conference (British Computer Society 2006)
Novel Approaches to Arabic Speech Recognition ( Johns Hopkins University summer work-
shop 2002)
APPENDIX B
Bateson, Mary Catherine. 2003. Arabic Language Handbook. Georgetown University Press.
Brustad, Kristen E. 2000. The Syntax of Spoken Arabic: A Comparative Study of Moroccan,
Egyptian, Syrian, and Kuwaiti Dialects. Georgetown University Press.
Fischer, W. 2001. A Grammar of Classical Arabic. Yale Language Series. Yale University
Press. Translated by Jonathan Rodgers.
Holes, Clive. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown Uni-
versity Press.
Ryding, Karin C. 2006. A Reference Grammar of Modern Standard Arabic. Cambridge Uni-
versity Press.
Schulz, Eckehard. 2008. A Student Grammar of Modern Standard Arabic. Cambridge Uni-
versity Press.
Buckley, Ron. 2004. Modern Literary Arabic: A Reference Grammar. Librairie du Liban.
Wright, W. 1896. A Grammar of the Arabic Language. Cambridge University Press. (a classic
grammar book)
Bohas, G., J. Guillaume and D. Kouloughli. 2006. The Arabic Linguistic Tradition. George-
town University Press.
Wehr, Hans. 1979. Dictionary of Modern Written Arabic (Arabic-English). Ithaca: Spoken
Language Services.
Hinds, Martin and El-Said Badawi. 1986. A Dictionary of Egyptian Arabic. Beirut: Librarie
du Liban.
Stowasser, Karl and Moukhtar Ani. 2004. A Dictionary of Syrian Arabic. Georgetown Uni-
versity Press.
Clarity, B. E., Karl Stowasser, Ronald G. Wolfe, D. R. Woodhead, and Wayne Beene. 2003.
A Dictionary of Iraqi Arabic. Georgetown University Press.
Harrell, Richard S. and Harvey Sobelman. 2004. A Dictionary of Moroccan Arabic. George-
town University Press.
Qafisheh, Hamdi. 1999. NTCs Gulf Arabic - English Dictionary. NTC Publishing Group.
Qafisheh, Hamdi. 1999. NTCs Yemeni Arabic - English Dictionary. NTC Publishing
Group.
Van den Bosch, A. and A. Soudi. 2007. Arabic Computational Morphology: Knowledge-based
and Empirical Methods. Springer.
Farghaly, Ali. 2010. Arabic Computational Linguistics. The University of Chicago Press.
Farghaly, Ali and Khaled Shaalan. 2009. Arabic Natural Language Processing: Challenges
and Solutions. A Special Issue of the ACM Transactions on Asian Language Information
Processing (TALIP).
Wintner, Shuly. 2009. Language Resources for Semitic Languages Challenges and Solutions.
In Sergei Nirenburg (ed.) Language Engineering for Lesser-Studied Languages. Amsterdam:
IOS Press.
B.4. TUTORIALS AND LECTURES 131
United Nations Report. 2003. Harmonization of ICT standards related to Arabic Language
use in information society applications. (A very informative report on information and com-
munication technology in the Arab World).
APPENDIX C
Levantine Arabic QT Training Data Set 5 (Audio) (Transcripts) A combination of four training
data sets totaling 250 hours of telephone conversation in Levantine Arabic
OrienTel is a European project focusing on the development of language resources for speech-
based applications (Website):
The 2010 NIST Open Handwriting Recognition and Translation Evaluation (OpenHaRT
2010)
ArabiCorpus
Official Document System of the United Nations (English, French, Spanish, Arabic, Russian,
Chinese)
Arabic Wikipedia with many terms paired with other languages (not strictly parallel)
NEMLAR Written Corpus: raw text, fully vowelized text, text with Arabic lexical analysis,
text with Arabic POS-tags ELRA Catalog W0042
C.3.5 TREEBANKS
Penn Arabic Treebank (LDC) Part 1 v 3.0 Part 2 v 2.0 Part 3 v 2.0
Prague Arabic Dependency Treebank 1.0: (through LDC) (114K tokens) PADT 2.0.
CATiB: Columbia Arabic Treebank 1.0 (Website) (available through LDC - LDC2009E06
- by request ldc@ldc.upenn.edu)
OntoNotes Release 3.0 (English, Arabic and Chinese texts annotated for syntax, predicate
argument structure, word sense and coreference). (BBNs webpage)
Bilingual Dictionary French Arabic, Arabic French (DixAF) ELRA Catalog M0040
Root list inside the morphological analyzer Sebawai (Contact Dr. Kareem Darwish)
C.5.6 GAZETTEERS
FAOTERM: United Nations Food and Agriculture Organization of the Terminology refer-
ence for country names (six languages including Arabic)
Geonames.des multilingual resource for names of geographical entities (and other things)
C.5. LEXICAL DATABASES 139
U.S. Board on Geographic Names (including Arab countries) uses SATTS Arabic translit-
eration
APPENDIX D
MAGEAD: Morphological Analysis and Generation for Arabic and its Dialects
AMIRA: Toolkit for Arabic tokenization, POS tagging and base phrase chunking
MADA: Morphological Analysis and Disambiguation for Arabic a tool for tokenization, lemma-
tization, diacritization and POS tagging
142 ARABIC NLP TOOLS
D.4 PARSERS
The Stanford Parser
The Bikel Parser
MALTParser
Mohammed Attias Rule-based Parser for MSA
D.5 TYPSETTING
ArabTEX (LATEX support for Arabic)
D.8 LEXICOGRAPHY
aConCorde: A concordance generation program for Arabic
APPENDIX E
BLARK: Basic Language Resource Kit a minimal set of language resources necessary to do
research.
NW: Newswire
STT: Speech-to-Text
TTS: Text-to-Speech
WB: Weblogs
147
Bibliography
[1] El-Said M. Badawi. Mustawayat al-Arabiyya al-muasira fi Misr (The Levels of Modern Arabic
in Egypt). Cairo: Dar al-Maarif, 1973. 1, 2
[2] Reem Bassiouney. Arabic Sociolinguistics: Topics in Diglossia, Gender, Identity, and Politics.
Georgetown University Press, 2009. 2
[4] Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter. On Arabic Transliteration. In
A. van den Bosch and A. Soudi, editors, Arabic Computational Morphology: Knowledge-based
and Empirical Methods. Springer, 2007. 4, 21, 27, 31
[5] Tim Buckwalter. Issues in Arabic Morphological Analysis. In A. van den Bosch and A. Soudi,
editors, Arabic Computational Morphology: Knowledge-based and Empirical Methods. Springer,
2007. 7, 13, 33
[6] Kam-Fai Wong, Wenji Li, Ruifeng Xu, and Zheng sheng Zhang. Introduction to Chinese
Natural Language Processing. Synthesis Lectures on Human Language Technologies. Morgan
and Claypool, 2010. 8
[7] Elsaid Badawi, Mike G. Carter, and Adrian Gully. Modern Written Arabic: A Comprehensive
Grammar. Routledge, 2004. 10, 63, 93
[8] Nizar Habash and Owen Rambow. Morphophonemic and Orthographic Rules in a Multi-
Dialectal Morphological Analyzer and Generator for Arabic Verbs. In International Sympo-
sium on Computer and Arabic Language (ISCAL), Riyadh, Saudi Arabia, 2007. 10, 32, 59
[9] Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. The Penn Arabic
Treebank: Building a Large-Scale Annotated Arabic Corpus, 2004. 11, 53, 79, 93, 104, 105
[10] Kais Dukes and Nizar Habash. Morphological Annotation of Quranic Arabic. In Proceedings
of the Language Resources and Evaluation Conference (LREC), Malta, 2010. 12, 112
[11] Rani Nelken and Stuart Shieber. Arabic Diacritization Using Weighted Finite-State Trans-
ducers. In Proceedings of the Workshop on Computational Approaches to Semitic Languages at 43rd
Meeting of the Association for Computational Linguistics (ACL05), pages 7986, Ann Arbor,
Michigan, 2005. DOI: 10.3115/1621787.1621802 13, 24
148 BIBLIOGRAPHY
[12] Imed Zitouni, Jeffrey S. Sorensen, and Ruhi Sarikaya. Maximum Entropy Based Restoration
of Arabic Diacritics. In Proceedings of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages
577584, Sydney, Australia, 2006. DOI: 10.3115/1220175.1220248 13, 24
[13] Ahmed El Kholy and Nizar Habash. Techniques for Arabic Morphological Detokenization
and Orthographic Denormalization. In Workshop on Language Resources and Human Language
Technology for Semitic Languages in the Language Resources and Evaluation Conference (LREC),
Valletta, Malta, 2010. 13, 36, 69, 77, 78, 122, 124
[14] Nizar Habash and Ryan Roth. Identification of Naturally Occurring Numerical Expres-
sions in Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC),
Marrakech, Morocco, 2008. 14, 118
[15] Alan Kaye. Adaptations of Arabic Script. In P.T. Daniels and W. Bright, editors, The Worlds
Writing Systems. Oxford University Press, 1996. 14
[16] Klaus Lagally. ArabTEX: Typesetting Arabic and Hebrew, User Manual Version 4.00. Tech-
nical Report 2004/03, Fakultt Informatik, Universitt Stuttgart, March 11 2004. 16, 76
[17] Nizar Habash. Nuun: A System for Developing Platform and Browser Independent Ara-
bic Web Applications. In Proceedings of the Arabic Translation and Localization Conference
(ATLAS-99), Tunis, Tunisia, 1999. 17
[19] Gina Engstrm. Internationalisation and Localisation Problems in the Chinese and Arabic
Scripts. Masters thesis, Uppsala University, 2008. 18, 19
[22] Ulf Hermjakob, Kevin Knight, and Hal Daum III. Name Translation in Statistical Machine
Translation - Learning When to Transliterate. In Proceedings of ACL-08: HLT, Columbus,
Ohio, 2008. 21, 36, 120
BIBLIOGRAPHY 149
[23] Tim Buckwalter. Buckwalter Arabic Morphological Analyzer Version 2.0, 2004. Linguistic
Data Consortium, University of Pennsylvania. LDC Cat alog No.: LDC2004L02, ISBN
1-58563-324-0. 20, 32, 36, 41, 47, 67, 68, 69, 70, 71
[24] Fadi Biadsy, Jihad El-Sana, and Nizar Habash. Online Arabic handwriting recognition using
Hidden Markov Models. In The 10th International Workshop on Frontiers in Handwriting
Recognition (IWFHR10), La Baule, France, 2006. 23
[25] Volker Mrgner and Haikal El Abed. Arabic Word and Text Recognition - Current Develop-
ments. In Khalid Choukri and Bente Maegaard, editors, Proceedings of the Second International
Conference on Arabic Language Resources and Tools, Cairo, Egypt, April 2009. The MEDAR
Consortium. 23
[26] Liana M. Lorigo and Venu Govindaraju. Offline Arabic Handwriting Recognition: A Sur-
vey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5):712724, 2006.
DOI: 10.1109/TPAMI.2006.102 23
[27] Kareem Darwish and Douglas W. Oard. Term Selection for Searching Printed Arabic. In
SIGIR 02: Proceedings of the 25th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 261268, New York, NY, USA, 2002. ACM.
DOI: 10.1145/564376.564423 23
[28] Walid Magdy and Kareem Darwish. Arabic OCR Error Correction Using Character Segment
Correction, Language Modeling, and Shallow Morphology. In Proceedings of 2006 Conference
on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 408414, Sydney,
Austrailia, 2006. DOI: 10.3115/1610075.1610132 23
[29] Prem Natarajan, Shirin Saleem, Rohit Prasad, Ehry MacRostie, and Krishna Subramanian.
Arabic and Chinese Handwriting Recognition, volume 4768 of Lecture Notes in Computer Science,
pages 231250. Springer, Berlin, Germany, 2008. 23
[30] Shirin Saleem, Huaigu Cao, Krishna Subramanian, Marin Kamali, Rohit Prasad, and Prem
Natarajan. Improvements in BBNs HMM-based Offline Handwriting Recognition System.
In Khalid Choukri and Bente Maegaard, editors, 10th International Conference on Document
Analysis and Recognition (ICDAR), Barcelona, Spain, July 2009. 23
[31] Zhidong Lu, Issam Bazzi, Andras Kornai, John Makhoul, Premkumar Natarajan, and Richard
Schwartz. A Robust, Language-Independent OCR System. In the 27th AIPR Workshop:
Advances in Computer Assisted Recognition, SPIE, 1999. DOI: 10.1117/12.339811 23
[32] Stephanie Strassel. Linguistic Resources for Arabic Handwriting Recognition. In Khalid
Choukri and Bente Maegaard, editors, Proceedings of the Second International Conference on
Arabic Language Resources and Tools, Cairo, Egypt, April 2009. The MEDAR Consortium.
23
150 BIBLIOGRAPHY
[33] Dimitra Vergyri and Katrin Kirchhoff. Automatic Diacritization of Arabic for Acoustic Mod-
eling in Speech Recognition. In Ali Farghaly and Karine Megerdoomian, editors, COLING
2004 Workshop on Computational Approaches to Arabic Script-based Languages, pages 6673,
Geneva, Switzerland, 2004. DOI: 10.3115/1621804 24, 37
[35] Nizar Habash and Owen Rambow. Arabic Diacritization through Full Morphological Tag-
ging. In Proceedings of the 8th Meeting of the North American Chapter of the Association for
Computational Linguistics/Human Language Technologies Conference (HLT-NAACL07), 2007.
DOI: 10.3115/1614108.1614122 24, 86, 88, 104
[36] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice Hall, New
Jersey, USA, 2000. 27
[37] Eugene E. Loos, Susan Anderson, Jr. Dwight H., Day, Paul C. Jordan, and J. Douglas Wingate.
Glossary of Linguistic Terms, 2004. 27, 41
[38] Clive Holes. Modern Arabic: Structures, Functions, and Varieties. Georgetown Classics in
Arabic Language and Linguistics. Georgetown University Press, 2004. 29, 30, 33, 42, 51, 52,
59, 63
[39] Janet C. E. Watson. The Phonology and Morphology of Arabic. Oxford University Press, 2002.
29, 30, 33
[40] Nizar Habash. On Arabic and its Dialects. Multilingual Magazine, 17(81), 2006. 30
[41] Fadi Biadsy, Nizar Habash, and Julia Hirschberg. Improving the Arabic Pronunciation Dic-
tionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules. In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics, pages 397405, Boulder, Colorado,
June 2009. Association for Computational Linguistics. DOI: 10.3115/1620754.1620812 30,
37
[42] Y. A. El-Imam. Phonetization of Arabic: Rules and Algorithms. In Computer Speech and
Language 18, pages 339373, 2004. DOI: 10.1016/S0885-2308(03)00035-4 31, 37
[43] Hany Hassan and Jeffrey Sorensen. An Integrated Approach for Arabic-English Named
Entity Translation. In Proceedings of the ACL Workshop on Computational Approaches to Semitic
Languages, pages 8793, Ann Arbor, Michigan, June 2005. Association for Computational
Linguistics. DOI: 10.3115/1621787.1621803 36
BIBLIOGRAPHY 151
[44] Bing Zhao, Nguyen Bach, Ian Lane, and Stephan Vogel. A Log-Linear Block Transliteration
Model based on Bi-Stream HMMs. In Human Language Technologies 2007: The Conference
of the North American Chapter of the Association for Computational Linguistics; Proceedings
of the Main Conference, pages 364371, Rochester, New York, April 2007. Association for
Computational Linguistics. 36
[45] A. Freeman, S. Condon, and C. Ackerman. Cross Linguistic Name Matching in English
and Arabic. In Proceedings of the Human Language Technology Conference of the NAACL, Main
Conference, pages 471478, New York City, USA, June 2006. Association for Computational
Linguistics. DOI: 10.3115/1220835.1220895 36, 120
[46] Bassam Haddad and Mustafa Yaseen. Detection and Correction of Non-Words in Arabic:
A Hybrid Approach. International Journal of Computer Processing Of Languages (IJCPOL),
2007. DOI: 10.1142/S0219427907001706 36
[47] Mohamed Maamouri, Ann Bies, and Seth Kulick. Enhancing the Arabic Treebank: a Collab-
orative Effort toward New Annotation Guidelines. In European Language Resources Associ-
ation (ELRA), editor, Proceedings of the Sixth International Language Resources and Evaluation
(LREC08), Marrakech, Morocco, May 2008. 36, 105
[48] Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed. Efficient Automatic Correction of
Misspelled Arabic Words Based on Contextual Information. In Proceedings of the Knowledge-
Based Intelligent Information and Engineering Systems Conference, Oxford, UK, 2003. 36
[49] Khaled Shaalan, Amin Allam, and Abdallah Gomah. Towards Automatic Spell Checking
for Arabic. In Conference on Language Engineering, ELSE, Cairo, Egypt, 2003. 36
[50] Nizar Habash. Four Techniques for Online Handling of Out-of-Vocabulary Words in
Arabic-English Statistical Machine Translation. In Proceedings of ACL-08: HLT, Short Pa-
pers, pages 5760, Columbus, Ohio, June 2008. Association for Computational Linguistics.
DOI: 10.3115/1557690.1557706 36, 120, 121
[51] Nizar Habash and Owen Rambow. Arabic Tokenization, Part-of-Speech Tagging and Mor-
phological Disambiguation in One Fell Swoop. In Proceedings of the 43rd Annual Meeting of
the Association for Computational Linguistics (ACL05), pages 573580, Ann Arbor, Michigan,
June 2005. Association for Computational Linguistics. DOI: 10.3115/1219840.1219911 36,
66, 80, 86, 88, 92
[52] Linguistic Data Consortium. Egyptian Colloquial Arabic Lexicon. LDC catalog number
LDC99L22, ISBN 1-58563-155-8, 1999. 36, 68
[53] David Graff, Tim Buckwalter, Hubert Jin, and Mohamed Maamouri. Lexicon Development
for Varieties of Spoken Colloquial Arabic. In LREC 2006: Fifth International Conference on
Language Resources and Evaluation, pages 9991004, Genova, Italy, 2006. 36
152 BIBLIOGRAPHY
[54] Mona Diab, Nizar Habash, Owen Rambow, Mohamed Altantawy, and Yassine Benajiba. CO-
LABA: Arabic Dialect Annotation and Processing. In Proceedings of the seventh International
Conference on Language Resources and Evaluation (LREC), Valletta, Malta, 2010. 36
[55] Fred Jelinek. Large Vocabulary Continuous Speech Recognition. Technical report, CLSP,
JohnsHopkins University, Baltimore, MD, 1997. Summer Research Workshop Technical
Reports. 37
[56] Katrin Kirchhoff, Jeff Bilmes, Sourin Das, Nicolae Duta, Melissa Egan, Gang Ji, Feng
He, John Henderson, Daben Liu, Mohamed Noamany, Pat Schone, Richard Schwartz,
and Dimitra Vergyri. Novel Approaches to Arabic Speech Recognition: Report from
the 2002 Johns-Hopkins Summer Workshop. In Proceedings of ICASSP 2003, 2003.
DOI: 10.1109/ICASSP.2003.1198788 37
[57] Katrin Kirchhoff, Dimitra Vergyri, Jeff A. Blimes, Kevin Duh, and Andreas Stolcke.
Morphology-based Language Modeling for Conversational Arabic Speech Recognition.
Computer Speech and Language, 20:589608, 2006. DOI: 10.1016/j.csl.2005.10.001 37
[58] D. Vergyri, A. Mandal, W. Wang, A. Stolcke, J. Zheng, M. Graciarena, D. Rybach, C. Gollan,
R. Schlter, K. Kirchhoff, A. Faria, and N. Morgan. Development of the SRI/Nightingale
Arabic ASR System. In In Proceedings of Interspeech 2008, 2008. 37
[59] R. Sproat, editor. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer,
Boston, MA, 1997. 37
[60] M. Afify, R. Sarikaya, H. Kuo, L. Besacier, and Y. Gao. On the Use of Morphological Analysis
for Dialectal Arabic Speech Recognition. In Proceedings of Interspeech 2006, Pittsburgh PA.,
2006. 37
[61] F. Diehl, M.J.F. Gales, M. Tomalin, and P.C. Woodland. Morphological Analysis and De-
composition for Arabic Speech-to-Text Systems. In Proceedings of InterSpeech, 2009. 37,
69
[62] Roger Hsiao, Ashish Venugopal, Thilo Khler, Ying Zhang, Paisarn Charoenpornsawat, An-
dreas Zollmann, Stephan Vogel, Alan W Black, Tanja Schultz, and Alex Waibel. Optimizing
Components for Handheld Two-way Speech Translation for an English-Iraqi Arabic System.
In INTERSPEECH, Pittsburgh, PA, 2006. 37
[63] Fawzi Alorfi. Automatic Identification Of Arabic Dialects Using Hidden Markov Models. PhD
thesis, University of Pittsburgh, 2008. 37
[64] Fadi Biadsy, Julia Hirschberg, and Nizar Habash. Spoken Arabic Dialect Identification
Using Phonotactic Modeling. In Proceedings of the EACL 2009 Workshop on Computational
Approaches to Semitic Languages, pages 5361, Athens, Greece, March 2009. Association for
Computational Linguistics. DOI: 10.3115/1621774.1621784 37
BIBLIOGRAPHY 153
[65] Fadi Biadsy and Julia Hirschberg. Using Prosody and Phonotactics in Arabic Dialect Iden-
tification. In Proceedings of Interspeech, Brighton, UK, 2009. 37
[66] Fadi Biadsy, Andrew Rosenberg, Rolf Carlson, Julia Hirschberg, and Eva Strangert. A
Cross-Cultural Comparison of American, Palestinian, and Swedish Perception of Charis-
matic Speech. In Speech Prosody, Campinas, Brazil, 2008. 37
[67] Otakar Smr. Functional Arabic Morphology. Formal System and Implementation. PhD thesis,
Charles University in Prague, Prague, Czech Republic, 2007. 39, 67, 69, 72, 75, 84, 107
[68] Georges Bohas. Matrices, tymons, Racines: lments dune thorie lexicographique du vocabu-
laire arabe. Peeters, Leuven, 1997. 41
[69] Z. Harris. Linguistic structure of Hebrew. Journal of the American Oriental Society, 62:14367,
1941. DOI: 10.2307/594501 43
[70] John J. McCarthy. A Prosodic Theory of Nonconcatenative Morphology. Linguistic Inquiry,
12:373418, 1981. 43
[71] Mohamed Maamouri and Ann Bies. Developing an Arabic Treebank: Methods, Guidelines,
Procedures, and Tools. In Proceedings of the COLING 2004 Workshop on Computational Ap-
proaches to Arabic Script-based Languages, pages 29, 2004. DOI: 10.3115/1621804.1621808
47, 105
[72] Mark W. Cowell. A Reference Grammar of Syrian Arabic. Georgetown University Press, 1964.
50
[73] Wallace Erwin. A Short Reference Grammar of Iraqi Arabic. Georgetown University Press,
1963. 50
[74] Ernest T. Abdel-Massih, Zaki N. Abdel-Malek, and El-Said M. Badawi. A Reference Grammar
of Egyptian Arabic. Georgetown University Press, 1979. 50
[75] Richard Harrell. A Short Reference Grammar of Moroccan Arabic. Georgetown University
Press, 1962. 50
[76] Eckehard Schulz. A Student Grammar of Modern Standard Arabic. Cambridge University
Press, New York, 2005. 51, 63, 93
[77] Ron Buckley. Modern Literary Arabic: A Reference Grammar. Librairie du Liban, 2004. 52,
57, 61, 63, 93
[78] William Wright. A Grammar of the Arabic Language. Cambridge University Press, reprint of
third revised edition, 1991. Translated from the German of Caspari and edited with numerous
additions and corrections by W. Wright, revised by W. Robertson Smith and M. J. de Goeje,
preface and addenda et corrigenda by Pierre Cachia. 55, 63, 93
154 BIBLIOGRAPHY
[79] Nizar Habash, Ryan Gabbard, Owen Rambow, Seth Kulick, and Mitch Marcus. Determin-
ing Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic
Features. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 1084
1092, 2007. 57, 104, 111
[80] Nizar Habash and Owen Rambow. MAGEAD: A Morphological Analyzer and Gener-
ator for the Arabic Dialects. In Proceedings of the 21st International Conference on Compu-
tational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics,
pages 681688, Sydney, Australia, July 2006. Association for Computational Linguistics.
DOI: 10.3115/1220175.1220261 59, 67, 69, 75
[81] Karin C Ryding. A Reference Grammar of Modern Standard Arabic. Reference Grammars.
Cambridge University Press, New York, 2006. 63, 93
[82] Mohamed Maamouri, Ann Bies, Sondos Krouna, Fatma Gaddeche, and Basma Bouziri. Penn
Arabic Treebank Guidelines. Linguistic Data Consortium, 2009. 63, 80, 93, 104, 105, 106
[83] Nizar Habash and Fatiha Sadat. Arabic Preprocessing Schemes for Statistical Machine
Translation. In Proceedings of the Human Language Technology Conference of the NAACL,
Companion Volume: Short Papers, pages 4952, New York City, USA, June 2006. Association
for Computational Linguistics. DOI: 10.3115/1614049.1614062 66, 76, 77, 88, 89, 121
[84] Muhammed Aljlayl and Ophir Frieder. On Arabic Search: Improving the Retrieval Ef-
fectiveness via a Light Stemming Approach. In Proceedings of ACM Eleventh Con-
ference on Information and Knowledge Management, Mclean, VA, pages 340347, 2002.
DOI: 10.1145/584792.584848 67
[85] Imad Al-Sughaiyer and Ibrahim Al-Kharashi. Arabic Morphological Analysis Techniques: A
Comprehensive Survey. Journal of the American Society for Information Science and Technology,
55(3):189213, 2004. DOI: 10.1002/asi.10368 67
[86] Kenneth Beesley. Arabic Finite-State Morphological Analysis and Generation. In Proceedings
of the 16th International Conference on Computational Linguistics (COLING-96), pages 8994,
Copenhagen, Denmark, 1996. DOI: 10.3115/992628.992647 67, 69, 73
[88] Tim Buckwalter. Buckwalter Arabic Morphological Analyzer Version 1.0, 2002. Linguistic
Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2002L49. 67, 70, 71,
75, 110
BIBLIOGRAPHY 155
[89] Kareem Darwish. Building a Shallow Morphological Analyzer in One Day. In Proceedings of
the workshop on Computational Approaches to Semitic Languages in the 40th Annual Meeting of
the Association for Computational Linguistics (ACL-02), pages 4754, Philadelphia, PA, USA,
2002. DOI: 10.3115/1118637.1118643 67, 68
[90] Jim Yaghi and Sane Yagi. Systematic Verb Stem Generation for Arabic. In COLING 2004
Computational Approaches to Arabic Script-based Languages, pages 2330, Geneva, Switzerland,
2004. DOI: 10.3115/1621804.1621812 67
[91] George Kiraz. Multi-tape Two-level Morphology: A Case study in Semitic Non-Linear
Morphology. In Proceedings of Fifteenth International Conference on Computational Linguistics
(COLING-94), pages 180186, Kyoto, Japan, 1994. DOI: 10.3115/991886.991917 67
[92] Nizar Habash, Owen Rambow, and George Kiraz. Morphological Analysis and Generation
for Arabic Dialects. In Proceedings of the Workshop on Computational Approaches to Semitic
Languages at 43rd Meeting of the Association for Computational Linguistics (ACL05), pages
1724, Ann Arbor, Michigan, 2005. DOI: 10.3115/1621787.1621791 67, 75
[94] Violetta Cavalli-Sforza, Abdelhadi Soudi, and Teruko Mitamura. Arabic Morphology Gen-
eration Using a Concatenative Strategy. In Proceedings of the 6th Applied Natural Language
Processing Conference (ANLP 2000), pages 8693, Seattle, Washington, USA, 2000. 67
[96] Nizar Habash. Large Scale Lexeme Based Arabic Morphological Generation. In Proceedings
of Traitement Automatique des Langues Naturelles (TALN-04), pages 271276, 2004. Fez,
Morocco. 67, 69, 71
[97] Nizar Habash. Arabic Morphological Representations for Machine Translation. In A. van den
Bosch and A. Soudi, editors, Arabic Computational Morphology: Knowledge-based and Empirical
Methods. Springer, 2007. 67, 72, 88, 89
[98] Franz Josef Och. Google System Description for the 2005 NIST MT Evaluation. In MT
Eval Workshop (unpublished talk), 2005. 68, 78, 121
[99] Ibrahim Badr, Rabih Zbib, and James Glass. Segmentation for English-to-Arabic
Statistical Machine Translation. In Proceedings of ACL-08: HLT, Short Papers, pages
156 BIBLIOGRAPHY
153156, Columbus, Ohio, June 2008. Association for Computational Linguistics.
DOI: 10.3115/1557690.1557732 69, 77, 79, 89, 122, 124
[100] George Anton Kiraz. Multi-Tiered Nonlinear Morphology Using Multi-Tape Finite Au-
tomata: A Case Study on Syriac and Arabic. Computational Linguistics, 26(1):77105, 2000.
DOI: 10.1162/089120100561647 73
[101] Mohamed Altantawy, Nizar Habash, Owen Rambow, and Ibrahim Saleh. Morphological
Analysis and Generation of Arabic Nouns: A Morphemic Functional Approach. In Pro-
ceedings of the seventh International Conference on Language Resources and Evaluation (LREC),
Valletta, Malta, 2010. 75
[102] Viktor Bielick and Otakar Smr. Enhancing the ElixirFM Lexicon with Verbal Valency
Frames. In Khalid Choukri and Bente Maegaard, editors, Proceedings of the Second Inter-
national Conference on Arabic Language Resources and Tools, Cairo, Egypt, April 2009. The
MEDAR Consortium. 75
[103] Markus Forsberg and Aarne Ranta. Functional Morphology. In Proceedings of the Ninth ACM
SIGPLAN International Conference on Functional Programming, ICFP 2004, pages 213223.
ACM Press, 2004. DOI: 10.1145/1016850.1016879 75
[104] Leah S. Larkey, Lisa Ballesteros, and Margaret E. Connell. Arabic Computational Morphology:
Knowledge-based and Empirical Methods, chapter Light Stemming for Arabic Information
Retrieval. Springer Netherlands, kluwer/springer edition, 2007. 77, 90
[105] Fatiha Sadat and Nizar Habash. Combination of Arabic Preprocessing Schemes for Sta-
tistical Machine Translation. In Proceedings of the 21st International Conference on Compu-
tational Linguistics and 44th Annual Meeting of the Association for Computational Linguis-
tics, pages 18, Sydney, Australia, July 2006. Association for Computational Linguistics.
DOI: 10.3115/1220175.1220176 77, 79, 88, 89, 121
[106] Hassan Al-Haj and Alon Lavie. The Impact of Arabic Morphological Segmentation on
Broad-coverage English-to-Arabic Statistical Machine Translation. In Proceedings of the
Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado,
2010. 77, 122, 124
[107] Jakob Elming and Nizar Habash. Combination of Statistical Word Alignments Based on
Multiple Preprocessing Schemes. In Human Language Technologies 2007: The Conference of
the North American Chapter of the Association for Computational Linguistics; Companion Volume,
Short Papers, pages 2528, Rochester, New York, April 2007. Association for Computational
Linguistics. DOI: 10.3115/1614108.1614115 77, 89, 121
BIBLIOGRAPHY 157
[108] Young-Suk Lee. Morphological Analysis for Statistical Machine Translation. In Proceedings
of the 5th Meeting of the North American Chapter of the Association for Computational Linguis-
tics/Human Language Technologies Conference (HLT-NAACL04), pages 5760, Boston, MA,
2004. DOI: 10.3115/1613984.1613999 79, 121
[109] Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. Automatic Tagging of Arabic Text: From
Raw Text to Base Phrase Chunks. In Proceedings of the 5th Meeting of the North American
Chapter of the Association for Computational Linguistics/Human Language Technologies Confer-
ence (HLT-NAACL04), pages 149152, Boston, MA, 2004. DOI: 10.3115/1613984.1614022
79, 80, 89
[110] Yuval Marton, Nizar Habash, and Owen Rambow. Improving Arabic Dependency Parsing
with Lexical and Inflectional Morphological Features. In Proceedings of the NAACL HLT
2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 1321,
Los Angeles, CA, USA, June 2010. Association for Computational Linguistics. 79, 83, 111,
112
[111] Mona Diab. Towards an Optimal POS tag set for Modern Standard Arabic Processing. In
Proceedings of Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria,
2007. 79, 83, 90
[112] Nizar Habash and Ryan Roth. CATiB: The Columbia Arabic Treebank. In Proceedings of
the ACL-IJCNLP 2009 Conference Short Papers, pages 221224, Suntec, Singapore, August
2009. Association for Computational Linguistics. DOI: 10.3115/1667583.1667651 79, 83,
104, 108, 112
[113] Shereen Khoja. APT: Arabic Part-of-Speech Tagger. In Proceedings of Student Research
Workshop at NAACL 2001, pages 2026, Pittsburgh, 2001. Association for Computational
Linguistics. 80, 84
[114] Jan Hajic, Otakar Smr, Tim Buckwalter, and Hubert Jin. Feature-based Tagger of Ap-
proximations of Functional Arabic Morphology. In Ma. Antonia Mart Montserrat Civit,
Sandra Kbler, editor, Proceedings of Treebanks and Linguistic Theories (TLT), pages 5364,
Barcelona, Spain, 2005. 80, 84, 107
[115] Noah Smith, David Smith, and Roy Tromble. Context-Based Morphological Disam-
biguation with Random Fields. In Proceedings of the 2005 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP05), pages 475482, Vancouver, Canada, 2005.
DOI: 10.3115/1220575.1220635 80
[116] Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. Arabic Mor-
phological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature
Ranking. In Proceedings of ACL-08: HLT, Short Papers, pages 117120, Columbus, Ohio, June
158 BIBLIOGRAPHY
2008. Association for Computational Linguistics. DOI: 10.3115/1557690.1557721 80, 86,
88, 104
[117] Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. Arabic Computational Morphology:
Knowledge-based and Empirical Methods, chapter Automated Methods for Processing Arabic
Text: From Tokenization to Base Phrase Chunking. Springer Netherlands, kluwer/springer
edition, 2007. 80, 89, 90, 112
[118] Mitchell M. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a Large
Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19.2:313330,
June 1993. 80, 105
[119] Seth Kulick, Ryan Gabbard, and Mitch Marcus. Parsing the Arabic Treebank: Analysis and
Improvements. In Proceedings of the Treebanks and Linguistic Theories Conference, pages 3142,
Prague, Czech Republic, 2006. 80, 82, 104, 111, 112
[120] Mona Diab. Improved Arabic Base Phrase Chunking with a New Enriched POS Tag Set.
In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Com-
mon Issues and Resources, pages 8996, Prague, Czech Republic, June 2007. Association for
Computational Linguistics. DOI: 10.3115/1654576.1654592 82, 90, 112
[121] Nizar Habash, Reem Faraj, and Ryan Roth. Syntactic Annotation in the Columbia Arabic
Treebank. In Proceedings of MEDAR International Conference on Arabic Language Resources
and Tools, Cairo, Egypt, 2009. 83, 104, 108
[122] Shereen Khoja, Roger Garside, and Gerry Knowles. A tagset for the morphosyntactic tagging
of Arabic. In Proceedings of Corpus Linguistics 2001, pages 341353, Lancaster, UK, 2001. 84
[123] Otakar Smr and Petr Zemnek. Sherds from an Arabic Treebanking Mosaic. Prague Bulletin
of Mathematical Linguistics, 78:6376, 2002. 84
[124] Marta R. Costa-juss, Josep M. Crego, Adri de Gispert, Patrik Lambert, Maxim Khalilov,
Jos A.R. Fonollosa, Jos B. Mario, and Rafael Banchs. TALP Phrase-Based System and
TALP System Combination for IWSLT 2006. In Proc. of the International Workshop on Spoken
Language Translation, pages 123129, Kyoto, Japan, 2006. 89
[125] Josep M. Crego, Adri de Gispert, Patrik Lambert, Maxim Khalilov, Marta R. Costa-juss,
Jos B. Mario, Rafael Banchs, and Jos A.R. Fonollosa. The TALP Ngram-based SMT
System for IWSLT 2006. In Proc. of the International Workshop on Spoken Language Translation,
pages 116122, Kyoto, Japan, 2006. 89
[126] David Vilar, Daniel Stein, Yuqi Zhang, Evgeny Matusov, Arne Mauser, Oliver Bender, Saab
Mansour, and Hermann Ney. The RWTH Machine Translation System for IWSLT 2008.
In Proc. of the International Workshop on Spoken Language Translation, pages 108115, Hawaii,
USA, 2008. 89
BIBLIOGRAPHY 159
[127] Mona Diab, Mahmoud Ghoneim, and Nizar Habash. Arabic Diacritization in the Context of
Statistical Machine Translation. In Proceedings of Machine Translation Summit (MT-Summit),
Copenhagen, Denmark, 2007. 89
[128] Benjamin Farber, Dayne Freitag, Nizar Habash, and Owen Rambow. Improving NER in
Arabic Using a Morphological Tagger. In Proceedings of the Language Resources and Evaluation
Conference (LREC), Marrakech, Morocco, 2008. 89, 117
[129] Yassine Benajiba, Mona Diab, and Paolo Rosso. Arabic Named Entity Recognition using
Optimized Feature Sets. In Proceedings of the 2008 Conference on Empirical Methods in Nat-
ural Language Processing, pages 284293, Honolulu, Hawaii, October 2008. Association for
Computational Linguistics. DOI: 10.3115/1613715.1613755 89, 90, 112, 117
[130] Josep M. Crego and Nizar Habash. Using Shallow Syntax Information to Improve Word
Alignment and Reordering for SMT. In Proceedings of the Third Workshop on Statistical Ma-
chine Translation, pages 5361, Columbus, Ohio, June 2008. Association for Computational
Linguistics. DOI: 10.3115/1626394.1626401 90, 112, 123
[132] Nicolas Stroppa and Andy Way. MATREX: DCU Machine Translation System for IWSLT
2006. In Proc. of the International Workshop on Spoken Language Translation, pages 3136,
Kyoto, Japan, 2006. 90
[133] David Farwell, Jess Gimnez, Edgar Gonzlez, Reda Halkoum, Horacio Rodrguez, and
Mihai Surdeanu. The UPC System for Arabic-to-English Entity Translation. In Proceedings
of ACE 2007, 2007. 90
[134] Robert D. Van Valin. An Introduction to Syntax. Cambridge University Press, 2001. 93
[135] Beatrice Santorini and Anthony Kroch. The Syntax of Natural Language: An Online Intro-
duction using the Trees Program, 2007. 93
[136] Mohamed Maamouri, Ann Bies, and Seth Kulick. Creating a Methodology for Large-Scale
Correction of Treebank Annotation: The Case of the Arabic Treebank. In Proceedings of
MEDAR International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2009.
93, 104, 105
[137] Otakar Smr and Jan Hajic. The Other Arabic Treebank: Prague Dependencies and Func-
tions. In Ali Farghaly, editor, Arabic Computational Linguistics: Current Implementations. CSLI
Publications, 2006. 104
160 BIBLIOGRAPHY
[138] Otakar Smr, Viktor Bielick, Iveta Kourilov, Jakub Krcmar, Jan Hajic, and Petr Zemnek.
Prague Arabic Dependency Treebank: A Word on the Million Words. In Proceedings of the
Workshop on Arabic and Local Languages (LREC 2008), pages 1623, Marrakech, Morocco,
2008. 104, 107, 117
[139] Mohamed Maamouri and Christopher Cieri. Resources for Natural Language Processing at
the Linguistic Data Consortium. In Proceedings of the International Symposium on Processing
of Arabic, pages 125146, Manouba, Tunisia, 2002. 105
[140] Mohamed Maamouri, Ann Bies, Tim Buckwalter, Mona Diab, Nizar Habash, Owen Ram-
bow, and Dalila Tabessi. Developing and Using a Pilot Dialectal Arabic Treebank. In Pro-
ceedings of the Fifth International Conference on Language Resources and Evaluation, LREC06,
Genoa, Italy, 2006. 105, 111
[141] Petr Sgall, Eva Hajicov, and Jarmila Panevov. The Meaning of the Sentence in Its Semantic
and Pragmatic Aspects. D. Reidel & Academia, 1986. 106
[142] Jan Hajic, Barbora Hladk, and Petr Pajas. The Prague Dependency Treebank: Annotation
Structure and Support. In Proceedings of the IRCS Workshop on Linguistic Databases, pages
105114, Philadelphia, 2001. University of Pennsylvania. 106
[144] Zdenek abokrtsk and Otakar Smr. Arabic Syntactic Trees: from Constituency to De-
pendency. In Proceedings of the Eleventh Conference of the European Chapter of the Associ-
ation for Computational Linguistics (EACL03) Research Notes, Budapest, Hungary, 2003.
DOI: 10.3115/1067737.1067779 107
[145] Jan Hajic, Otakar Smr, Petr Zemnek, Jan naidauf, and Emanuel Beka. Prague Arabic De-
pendency Treebank: Development in Data and Tools. In NEMLAR International Conference
on Arabic Language Resources and Tools, pages 110117. ELDA, 2004. 107
[146] Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gulsen Eryigit, Sandra Kubler,
Svetoslav Marinov, and Erwin Marsi. MaltParser: A Language-independent System for
Data-driven Dependency Parsing. Natural Language Engineering, 13(2):95135, 2007.
DOI: 10.1017/S1351324906004505 107, 112
[147] Fei Xia, Owen Rambow, Rajesh Bhatt, Martha Palmer, and Dipti Misra Sharma. Towards
a Multi-Representational Treebank. In Proceedings of Treebanks and Linguistic Theories (TLT
7), Groningen, Netherlands, 2009. 109
BIBLIOGRAPHY 161
[148] David Chiang, Mona Diab, Nizar Habash, Owen Rambow, and Safiullah Shareef. Parsing
Arabic Dialects. In Proceedings of the European Chapter of ACL (EACL), 2006. 111, 112
[149] Kais Dukes and Tim Buckwalter. A Dependency Treebank of the Quran using Traditional
Arabic Grammar. In Proceedings of the 7th international conference on Informatics and Systems
(INFOS 2010), Cairo, Egypt, 2010. 112
[150] Lamia Tounsi, Mohammed Attia, and Josef van Genabith. Automatic Treebank-Based Ac-
quisition of Arabic LFG Dependency Structures. In Proceedings of the EACL 2009 Work-
shop on Computational Approaches to Semitic Languages, pages 4552, Athens, Greece, 2009.
DOI: 10.3115/1621774.1621783 112
[151] Martha Palmer, Olga Babko-Malaya, Ann Bies, Mona Diab, Mohamed Maamouri, Aous
Mansouri, and Wajdi Zaghouani. A Pilot Arabic Propbank. In Proceedings of LREC, Mar-
rakech, Morocco, May 2008. 112, 114
[152] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel.
OntoNotes: The 90% Solution. In NAACL 06: Proceedings of the Human Language Technology
Conference of the NAACL, Companion Volume: Short Papers on XX, pages 5760, Morristown,
NJ, USA, 2006. Association for Computational Linguistics. DOI: 10.3115/1614049.1614064
112, 117
[154] Ryan Gabbard and Seth Kulick. Construct State Modification in the Arabic Treebank.
In Proceedings of ACL-08: HLT, Short Papers, pages 209212, Columbus, Ohio, June 2008.
Association for Computational Linguistics. DOI: 10.3115/1557690.1557750 112
[155] Dan Klein and Christopher D. Manning. Accurate Unlexicalized Parsing. In Proceed-
ings of the 41st Meeting of the Association for Computational Linguistics (ACL03), 2003.
DOI: 10.3115/1075096.1075150 112
[156] Spence Green, Conal Sathi, and Christopher D. Manning. NP Subject Detection in Verb-
initial Arabic Clauses. In Proceedings of the Third Workshop on Computational Approaches to
Arabic Script-based Languages (CAASL3), 2009. 112, 123
[157] Michael C. McCord and Violetta Cavalli-Sforza. An Arabic Slot Grammar Parser.
In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages,
pages 8188, Morristown, NJ, USA, 2007. Association for Computational Linguistics.
DOI: 10.3115/1654576.1654591 112
162 BIBLIOGRAPHY
[158] Eman Othman, Khaled Shaalan, and Ahmed Rafea. A Chart Parser for Analyzing Mod-
ern Standard Arabic Sentence. In Proceedings of the MT Summit IX Workshop on Machine
Translation for Semitic Languages: Issues and Approaches, pages 3744, 2003. 112
[160] R. Jackendoff. Semantic Structures. MIT Press, Boston, Mass, 1990. 113
[163] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The Berkeley FrameNet Project.
In COLING-ACL 98: Proceedings of the Conference, held at the University of Montral, pages
8690, 1998. DOI: 10.3115/980845.980860 114
[164] Nianwen Xue and Martha Palmer. Adding Semantic Roles to the Chinese Treebank. Nat.
Lang. Eng., 15(1):143172, 2009. DOI: 10.1017/S1351324908004865 114
[165] Mona Diab, Alessandro Moschitti, and Daniele Pighin. Semantic Role Labeling Systems for
Arabic using Kernel Methods. In Proceedings of ACL-08: HLT, pages 798806, Columbus,
Ohio, June 2008. Association for Computational Linguistics. 115
[166] Mona Diab, Musa Alkhalifa, Sabry ElKateb, Christiane Fellbaum, Aous Mansouri, and
Martha Palmer. SemEval-2007 Task 18: Arabic Semantic Labeling. In Proceed-
ings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages
9398, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
DOI: 10.3115/1621474.1621491 115, 116
[167] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998. http://
www.cogsci.princeton.edu/wn [2000, September 7]. 116
[168] Piek Vossen. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer
Academic Publishers, Dordrecht, 1998. 116
[182] Jason Riesa and David Yarowsky. Minimally Supervised Morphological Segmentation with
Applications to Machine Translation. In Proceedings of the 7th Conference of the Association for
Machine Translation in the Americas (AMTA06), pages 185192, Cambridge,MA, 2006. 121
[183] Anas El Isbihani, Shahram Khadivi, Oliver Bender, and Hermann Ney. Morpho-syntactic
Arabic Preprocessing for Arabic to English Statistical Machine Translation. In Proceedings
on the Workshop on Statistical Machine Translation, pages 1522, New York City, June 2006.
Association for Computational Linguistics. DOI: 10.3115/1654650.1654654 121
[184] Nizar Habash. Syntactic Preprocessing for Statistical MT. In Proceedings of the Machine
Translation Summit (MT SUMMIT XI), Copenhagen, Denmark, 2007. 123
[185] Nizar Habash, Bonnie Dorr, and Christof Monz. Challenges in Building an Arabic-English
GHMT System with SMT Components. In Proceedings of the 7th Conference of the Association
for Machine Translation in the Americas (AMTA06), pages 5665, Cambridge, MA, 2006. 123
[186] Steve DeNeefe and Kevin Knight. Synchronous Tree Adjoining Machine Translation.
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Process-
ing, pages 727736, Singapore, August 2009. Association for Computational Linguistics.
DOI: 10.3115/1699571.1699607 123
[187] Marine Carpuat, Yuval Marton, and Nizar Habash. Improving Arabic-to-English Statistical
Machine Translation by Reordering Post-Verbal Subjects for Alignment. In Proceedings of the
ACL 2010 Conference Short Papers, pages 178183, Uppsala, Sweden, July 2010. Association
for Computational Linguistics. 123
[188] Ibrahim Badr, Rabih Zbib, and James Glass. Syntactic Phrase Reordering for English-to-
Arabic Statistical Machine Translation. In Proceedings of the 12th Conference of the European
Chapter of the ACL (EACL 2009), pages 8693, Athens, Greece, March 2009. Association for
Computational Linguistics. DOI: 10.3115/1609067.1609076 123, 124
[189] Jakob Elming and Nizar Habash. Syntactic Reordering for English-Arabic Phrase-Based
Machine Translation. In Proceedings of the EACL 2009 Workshop on Computational Approaches
to Semitic Languages, pages 6977, Athens, Greece, March 2009. Association for Computa-
tional Linguistics. DOI: 10.3115/1621774.1621786 123
BIBLIOGRAPHY 165
[190] Saa Hasan, Anas El Isbihani, and Hermann Ney. Creating a Large-Scale Arabic to French
Statistical Machine Translation System . In Proceedings of Language Resources and Evaluation
Conference (LREC), pages 855858, Genoa, Italy, 2006. 124
[191] Nizar Habash and Jun Hu. Improving Arabic-Chinese Statistical Machine Translation using
English as Pivot Language. In Proceedings of the Fourth Workshop on Statistical Machine
Translation, pages 173181, Athens, Greece, March 2009. Association for Computational
Linguistics. DOI: 10.3115/1626431.1626467 124
[192] Mossab Al-Hunaity, Bente Maegaard, and Dorte Hansen. Using English as a Pivot Lan-
guage to Enhance Danish-Arabic Statistical Machine Translation. In Workshop on Language
Resources and Human Language Technology for Semitic Languages in the Language Resources and
Evaluation Conference (LREC), Valletta, Malta, 2010. 124
[193] Reshef Shilon, Nizar Habash, Alon Lavie, and Shuly Wintner. Machine Translation between
Hebrew and Arabic: Needs, Challenges and Preliminary Solutions. In Proceedings of the
Student Research Workshop in the Conference of the Association for Machine Translation in the
Americas (AMTA), Denver, Colorado, 2010. 124
[194] Haytham Alsharaf, Sylviane Cardey, Peter Greenfield, and Yihui Shen. Problems and So-
lutions in Machine Translation Involving Arabic, Chinese and French. In Proceedings of the
International Conference on Information Technology, pages 293297, Las Vegas, Nevada, 2004.
DOI: 10.1109/ITCC.2004.1286649 124
[195] Mohammed Sharaf. Implications of the Agreement Features in (English to Arabic) Machine
Translation. Masters thesis, Al-Azhar University, 2002. 124
[196] Abdelhadi Soudi, Violetta Cavalli-Sforza, and Abderrahim Jamari. A Prototype English-to-
Arabic Interlingua-based MT system. In Proceedings of the Third International Conference on
Language Resources and Evaluation: Workshop on Arabic language resources and evaluation, Las
Palmas, Spain, 2002. 124
[197] Abdelhadi Soudi. Challenges in the Generation of Arabic from Interlingua. In Proceedings
of Traitement Automatique des Langues Naturelles (TALN-04), pages 343350, 2004. Fez,
Morocco. 124
[198] Azza Abdel-Monem, Khaled Shaalan, Ahmed Rafea, and Hoda Baraka. A Proposed Ap-
proach for Generating Arabic from Interlingua in a Multilingual Machine Translation System.
In Proceedings of the 4th Conference on Language Engineering, pages 197206, 2003. Cairo,
Egypt. 124
[199] F. Gey and D. Oard. The TREC-2001 Cross-Language Information Retrieval Track: Search-
ing Arabic Using English, French or Arabic Queries. In The 10th Text Retrieval Conference
(TREC-10), 2001. 124
166 BIBLIOGRAPHY
[200] F. Gey and D. Oard. The TREC-2002 Arabic/English Cross-Language Information Re-
trieval Track. In The 11th Text Retrieval Conference (TREC-11), 2002. 124
[201] Jinxi Xu, Alexander Fraser, and Ralph Weischedel. Empirical Studies in Strategies for Arabic
Retrieval. In SIGIR 02: Proceedings of the 25th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 269274, New York, NY, USA, 2002.
ACM. DOI: 10.1145/564376.564424 124
[202] Leah S. Larkey, Lisa Ballesteros, and Margaret E. Connell. Improving Stemming for Arabic
Information Retrieval: Light Stemming and Co-occurrence Analysis. In Proceedings of the
25th Annual International Conference on Research and Development in Information Retrieval
(SIGIR 2002), Tampere, Finland, pages 275282, 2002. DOI: 10.1145/564376.564425 124
[203] K. Darwish and D. Oard. CLIR Experiments at Maryland for TREC 2002: Evidence
Combination for Arabic-English Retrieval. In The 11th Text Retrieval Conference (TREC-
11), 2002. 124
[204] Ramzi Abbs, Joseph Dichy, and Mohamed Hassoun. The Architecture of a Standard Ara-
bic Lexical Database. Some Figures, Ratios and Categories from the DIINAR.1 Source
Program. In Ali Farghaly and Karine Megerdoomian, editors, COLING 2004 Computational
Approaches to Arabic Script-based Languages, pages 1522, Geneva, Switzerland, August 28th
2004. COLING. 138
167
Authors Biography
NIZAR HABASH
Nizar Habash is a research scientist at the Center for Computational Learning Systems in Columbia
University, where he has worked since 2004. He received a B.Sc. in Computer Engineering and a B.A.
in Linguistics and Languages from Old Dominion University in 1997. He received his Ph.D. in 2003
from the Computer Science Department, University of Maryland College Park. His Ph.D. thesis is
titled Generation-Heavy Hybrid Machine Translation. In 2005, he co-founded the Columbia Arabic
Dialect Modeling (CADIM) group with Mona Diab and Owen Rambow. Nizars research includes
work on machine translation, natural language generation, lexical semantics, morphological analysis,
generation and disambiguation, syntactic parsing and annotation, and computational modeling of
Arabic and its dialects.
Nizar currently serves as secretary of the board of AMTA (Association for Machine Transla-
tion in the Americas) and of IAMT (International Association for Machine Translation). He served
as vice-president of the Semitic Language Special Interest Group in the Association of Computa-
tional Linguistics (ACL) (2006-2009). He also served as the research community representative on
the AMTA board (2006-2008). He previously served as a research program co-chair for the AMTA
2006 conference, the Workshop on Computational Approaches to Semitic Languages (ACL 2005)
and the Workshop on Machine Translation for Semitic Languages (MT Summit 2003).
Nizar has published over 80 papers in international conferences and journals and has given
numerous lectures and tutorials for academic and industrial audiences.
Nizars website is located at http://www.nizarhabash.com/.