1-Lecture One - (Chapter One-Introduction (NLP) )

Chapter 1 : NLP – Introduction

Adama Science and Technology University

School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2022)

 Introduction to NLP
 What is NLP?
 Aspects of Language Processing
 Goal of NLP
 History of NLP
 Application of NLP
 Open Problems
 Knowledge Sources
 Computational Morphology

What is Natural Language
Processing ?

 Natural Language Processing is one of the subfields of Artificial

 Natural Language Processing (NLP)
 Knowledge Representation
 Automated Reasoning
Disciplines of AI
 Machine Learning
 Computer Vision
 Robotics

What is Natural Language
Processing ?

 Develop computers that can understand human (“natural”)

language and speak human language.

 Computers would be much easier to use with Natural

Language Interface.

 Can we “teach” them to understand human language?

 Many approaches to NLP are based on Machine Learning.

What is Natural Language
Processing ?
 NLP is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to fruitfully
process large natural language corpora*.

 Sub-field of Artificial Intelligence, but very interdisciplinary.

 Computer science, human-computer interaction (HCI), linguistics,
cognitive psychology, speech signal processing, …

 *A corpus is a collection of written or spoken texts.

 With the use of computers, it is possible to compile large
amounts of authentic written and spoken language.
Aspects of Language Processing

 “Natural” languages:
 Geez, Amharic, Oromifa, Tigrigna,, English, Mandarin, French,
Swahili, Arabic, …

 NOT Java, C++, Perl, … (Programming Languages)

Aspects of Language Processing

 Traditionally, work in NLP has tended to view the process of

language analysis as being decomposable into a number of stages,
mirroring the theoretical linguistic distinctions drawn between
 The simple view is that the sentences of a text are first analyzed in terms
of their syntax.
 This provides an order and structure that is more amenable to an
analysis in terms of semantics, or literal meaning; and

 This is followed by a stage of pragmatic analysis whereby the meaning

of the utterance or text in context is determined.
 This last stage is often seen as being concerned with DISCOURSE,
whereas the previous two are generally concerned with sentential
Aspects of Language Processing

Figure 1. The Stages of Analysis in Processing Natural Language

 The tripartite distinction into
syntax, semantics, and pragmatics
only serves at best as a starting
point when to consider the
processing of real NL texts.

 A finer-grained decomposition of
the process is useful when taken
into account the current state of
the art in combination with the
need to deal with real language
data as reflected in Figure.

Aspects of Language Processing

 Word, lexicon: lexical analysis:

 Morphology, word segmentation;
 In lexical analysis, tokenization is the process of breaking a stream of text
up into words, phrases, symbols, or other meaningful elements called
tokens. The list of tokens becomes input for further processing such as
parsing or text mining.
 Tokenization is useful both in linguistics (where it is a form of text
segmentation), and in computer science, where it forms part of lexical
 Tokenization is the task of chopping it up into pieces, called tokens ,
perhaps at the same time throwing away certain characters, such as
 Example:
Input: Friends, Romans, Countrymen, lend me your ears;
Aspects of Language Processing

 Syntax:
 Sentence structure, phrase, grammar, …

 Semantics:
 Meaning,
 Execute commands

 Discourse analysis:
 Meaning of a text,
 Relationship between sentences (e.g. anaphora)

Aspects of Language Processing

 Morphology: What is a word?

 奧林匹克運動會(希臘語: Ολυμπιακοί Αγώνες , 簡稱奧運會或 奧運)是國際奧林
匹克委員會主辦的包含多種體育運動項目的國際 性運動會,每四年舉行一次。•
‫“ = آ‬to her houses”

 Lexicography: What does each word mean?

 He plays bass guitar.
 That bass was delicious!

 Syntax: How do the words relate to each other?

 The dog bit the man. ≠ The man bit the dog.
 But in Russian: человек собаку съел = человек съел собаку
 man ate a dog = man ate a dog

Aspects of Language Processing

 Semantics: How can we infer meaning from sentences?

 I saw the man on the hill with the telescope.
 The ipod is so small! ☺
 The monitor is so small!

 Discourse: How about across many sentences?

 President Obama met with President-Elect Trump today at the
White House. He welcomed him, and showed him around.
 Who is “he”? Who is “him”? How would a computer figure that

Aspects of Language Processing

 Syntax
 Lemmatization:
Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to
remove inflectional endings only and to return the base or dictionary form
of a word, which is known as the lemma.
 Morphological segmentation:
 Separate words into individual morphemes and identify the class of the
 The difficulty of this task depends greatly on the complexity of the
morphology (i.e. the structure of words) of the language being considered.

Aspects of Language Processing

 Syntax …
 Part-of-speech tagging:
 Example, "book" can be a noun ("the book on the table") or verb ("to
book a flight")

 Sentence breaking (also known as sentence boundary disambiguation):

 Given a chunk of text, find the sentence boundaries.
 Sentence boundaries are often marked by periods or other punctuation
marks, but these same characters can serve other purposes (e.g.
marking abbreviations).

Aspects of Language Processing

 Syntax …
 Stemming
 Stemming usually refers to a crude heuristic process that chops off the ends of
words in the hope of achieving this goal correctly most of the time, and often
includes the removal of derivational affixes.
 Word segmentation
 Separate a chunk of continuous text into separate words.
 For a language like English, this is fairly trivial, since words are usually separated
by spaces.
 However, some written languages like Chinese, Japanese and Thai do not mark
word boundaries in such a fashion, and in those languages text segmentation is a
significant task requiring knowledge of the vocabulary and morphology of words
in the language.

Aspects of Language Processing

 Semantics (Individual Assignment One : Defined the terms)

 Lexical semantics
 Machine translation
 Named entity recognition
 Natural language generation
 Natural language understanding
 Optical character recognition
 Question answering
 Recognizing Textual entailment
 Relationship extraction
 Sentiment analysis
 Topic segmentation
 Word sense disambiguation

Aspects of Language Processing

 Discourse :
 Automatic summarization
 Coreference resolution
 Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same
objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically
concerned with matching up pronouns with the nouns or names to which they refer.
 The more general task of coreference resolution also includes identifying so-called "bridging
relationships" involving referring expressions.
 For example, in a sentence such as "He entered John's house through the front door", "the front
door" is a referring expression and the bridging relationship to be identified is the fact that the
door being referred to is the front door of John's house (rather than of some other structure that
might also be referred to).
 Discourse analysis:

Aspects of Language Processing

 Speech Processing
 Speech recognition
 Speech segmentation
 Given a sound clip of a person or people speaking, separate
it into words.
 A subtask of speech recognition and typically grouped with

 Text-to-speech (Speech Synthesis)

Goal of Natural Language Processing

 Ultimate goal: Natural human-to-computer communication.

 The goal of natural language processing (NLP) is to design and build
computer systems that are able to analyze natural languages like Geez,
Amharic, German or English, and that generate their outputs in a natural
language, too.
 In natural language understanding, the objective is to extract the meaning
of an input sentence or an input text. Usually, the meaning is represented in
a suitable formal representation language so that it can be processed by a
 The goal in text classification is to assign a text document to one out of
several text classes.
 Example: for newspaper articles, such classes are sports reports, finances, and

History of Natural Language
 1950s
 Early MT: word translation + re-ordering.
 Chomsky’s Generative grammar.
 Bar-Hill’s argument.
 1960-80s
 Applications:
 BASEBALL: use NL interface to search in a database on baseball
 LUNAR: NL interface to search in Lunar.
 ELIZA: simulation of conversation with a psychoanalyst.
 SHREDLU: use NL to manipulate block world.
 Message understanding: understand a newspaper article on terrorism.
 Machine translation.
History of Natural Language
 1960-80s
 Methods
 ATN (augmented transition networks): extended context-free
 Case grammar (agent, object, etc.)
 DCG – Definite Clause Grammar
 Dependency grammar: an element depends on another
 1990s-now
 Statistical methods
 Speech recognition
 MT systems
 Question-answering
 etc…
History of Natural Language

 Traditional NLP approaches: symbolic, grammar, …

 More recent approaches: statistical
 For some applications: statistical approaches are better
(tagging, speech recognition, …)

 For some others, traditional approaches are better (MT)

 Trend: combine statistics with rules (grammar)

 E.g. Probabilistic Context Free Grammar (PCFG)
 Consider some grammatical connections in statistical approaches.

History of Natural Language
 Classical symbolic methods:
 Morphological analyzer

 Parser (syntactic analysis)

 Semantic analysis (transform into a logical form, semantic

network, etc.)

 Discourse analysis

 Pragmatic analysis

History of Natural Language
 Empirical and Statistical Approaches
 Corpus Creation

 Treebank Annotation

 Fundamental Statistical Techniques

 Part-of-Speech Tagging

 Statistical Parsing

 Etc…
NLP Applications

 Intelligent computer systems  Information Retrieval,

 NLU interfaces to databases  Information retrieval (IR) deals
 Computer aided instruction with the representation, storage,
organization of, and access to
 Intelligent Web searching information items. Given a query
 Data mining the goal is to extract a subset of
documents from a large data
 Machine translation collection that satisfies a user's
 Speech recognition information need. Besides written
 Natural language generation
texts the database may also contain
multimedia documents, e.g. audio
 Question answering and video data.
 Text classification  Query-answering
 Text categorization
 etc  Summarization
NLP Applications

 Speech Synthesis
 Text to Speech:

01/02/23 26
Open Problems in NLP

 Challenges in natural language processing frequently involve:

 Natural language understanding,

 Natural language generation (frequently from formal, machine-

readable logical forms),

 Connecting language and machine perception,

 Managing human-computer dialog systems, or

 Some combination thereof…

Open Problems in NLP

 Ambiguity:

 Lexical/morphological: change (V,N), training (V,N), even

(ADJ, ADV) …
 Syntactic: Helicopter powered by human flies.

 Semantic: He saw a man on the hill with a telescope.

 Discourse: anaphora, …

Open Problems in NLP

 Classical solution:
 Using a later analysis to solve ambiguity of an earlier step.
 Eg. He gives him the change:
 (change as verb does not work for parsing)
 He changes the place:
 (change as noun does not work for parsing)

 However: He saw a man on the hill with a telescope.

 Correct multiple parsings,
 Correct semantic interpretations -> semantic ambiguity,
 Use contextual information to disambiguate (does a sentence in the
text mention that “He” holds a telescope?)

Knowledge Sources

 When using NLP for a new domain, one also needs to answer
what text source should be used for extracting content.
 Of course, not any arbitrary text source is applicable.
 In order to qualify as a source, the text type needs to meet the
following two criteria:
 Firstly, the text type needs to contain sufficient domain
 In other words, if we choose a text type that only infrequently
contains content regarding a given domain, then we are not very
likely to extract any significant amount of knowledge.
 In the past, most research in NLP has been carried out on news
corpora. The topic that is predominant on this text type are issues
out of the domain. Consequently, this text type would be of little
value for knowledge extraction.
Knowledge Sources

 In order to qualify as a source, the text type needs to meet the

following two criteria:…
 Secondly, the text type should not only contain knowledge about
the domain that is already widely available in structured format
(such as databases)

 Otherwise, there would hardly be any point in extracting

knowledge from those texts as it would already be available.

Computational Morphology

 What is it?
 Morphology: the study/knowledge of structure/form.
• In this case: of words,
• How words are created, structured, analyzed
• Morpheme: basic meaningful unit of language.

 Computational morphology: developing/using computer

applications that involve morphology.

Computational Morphology

 Computational applications:
 Analysis: parse/break a word into its constituent morphemes.

 Generation: create/generate a word from its constituent


Computational Morphology

 Morphological processes:
 Affixation: prefix, suffix, infix
 Interleaving (KaTaB, uKTaB)
 Cliticization (isn’t, s’appelle)
 Internal change: (sing/sang, goose/geese)
 Suppletion (irregularity): (aller/ir, be/am)
 Stress placement: implant, import, contest
 Tone placement: dà vs. dá ( will spank vs. spanked)
 Reduplication
 Full: iji/ijiiji
 Partial: lakad/lalakad

Computational Morphology

 Areas of focus in morphology:

 Derivational
 do+able, adjourn+ment, depos+ition, un+lock, teach+er

 Inflectional
 dog+s, sneez+ed

 Compounding
 overkill, BYU intramural track star

 Cliticization
 I’m, she’ll, they’ve, o’clock

Computational Morphology

 Computational morphology
 Processing morphological structure via computer (parsing,
 Traditional approach:
 ad-hoc methods,
 Cut-and-paste algorithms,
 Dictionary lookup,
 Inadequate for highly inflected languages.

 Even statistical approaches are often un-useful,

 Two-level approach w/finite-state techniques,
 Machine learning is making inroads,
 Sequence labeling, morpheme boundary detection.

Question & Answer

Thank You !!!

