NLP Notes
NLP Notes
NLP Notes
NLP:-
Natural Language Processing (NLP) is a branch of artificial intelligence that deals with the
interaction between computers and human languages. It enables computers to understand,
analyse, and generate natural language texts and speech.
Applications of NLP:-
NLP has many real life applications in various domains and industries.
Chatbots: Chatbots are software programs that can converse with humans using natural
language, either through text or voice. They can provide customer service, information,
entertainment, or guidance. Some examples of chatbots are Siri, Alexa, Google Assistant,
etc.
Sentiment Analysis: Sentiment analysis is the process of identifying and extracting the
emotions, opinions, or attitudes expressed in text or speech. It can help businesses
understand customer feedback, monitor brand reputation, or analyze social media trends.
Some examples of sentiment analysis tools are IBM Watson, Google Cloud Natural
Language, etc.
Speech Recognition: Speech recognition is the process of converting spoken words into
text or commands. It can help people interact with devices or applications using voice, or
transcribe audio or video content. Some examples of speech recognition tools are Apple
Dictation, Google Voice, etc.
Text Summarization: Text summarization is the process of creating a short and concise
summary of a longer text document. It can help people save time, extract key information, or
generate overviews. Some examples of text summarization tools are Microsoft Word,
Google Docs, etc.
Question Answering: Question answering is the process of providing direct and accurate
answers to natural language questions. It can help people access information, solve
problems, or learn new facts. Some examples of question answering systems are IBM
Watson, Google Search, etc.
Text Generation: Text generation is the process of creating natural language texts from
data, keywords, or other inputs. It can help people write, edit, or optimize content, or
generate creative or personalized texts. Some examples of text generation tasks are content
writing, headline generation, email writing, etc.
Spell Checking: Spell checking is the process of detecting and correcting spelling errors in
text documents. It can help people improve their writing quality, avoid mistakes, or learn new
words. Some examples of spell checking tools are Grammarly, Microsoft Word, etc.
Plagiarism Detection: Plagiarism detection is the process of identifying and measuring the
similarity or originality of text documents. It can help people avoid academic dishonesty,
protect intellectual property, or cite sources. Some examples of plagiarism detection tools
are Turnitin, Copyscape, etc.
Text Clustering: Text clustering is the process of grouping text documents based on their
similarity or relevance. It can help people discover patterns, trends, or insights from text
data. Some examples of text clustering tasks are document clustering, keyword clustering,
news clustering, etc.
Keyword Extraction: Keyword extraction is the process of identifying and extracting the
most important or representative words or phrases from text documents. It can help people
summarize, index, or search text data. Some examples of keyword extraction tools are
RapidMiner, TextRazor, etc.
Text Simplification: Text simplification is the process of reducing the complexity or difficulty
of text documents while preserving their meaning or information. It can help people improve
readability, accessibility, or comprehension of text data. Some examples of text simplification
tools are Rewordify, Simple English Wikipedia, etc.
Text to Speech: Text to speech is the process of converting text into natural sounding
speech. It can help people listen to text content, access information, or communicate with
others. Some examples of text to speech tools are Amazon Polly, Google Text-to-Speech,
etc.
Speech to Text: Speech to text is the process of converting speech into text. It can help
people transcribe audio or video content, create subtitles, or dictate text. Some examples of
speech to text tools are Google Speech-to-Text, Microsoft Speech Services, etc.
Key Terms:-
Corpus: A corpus is a large and structured collection of texts that are used for linguistic
analysis and NLP tasks. A corpus can be general or domain-specific, and it can contain
various types of annotations, such as part-of-speech tags, syntactic structures, semantic
roles, etc. Some examples of corpora are the Brown Corpus, the Penn Treebank, the British
National Corpus, etc.
Lemma: A lemma is the base or dictionary form of a word, which can have different inflected
forms. For example, the lemma of the words “running”, “runs”, and “ran” is “run”.
Lemmatization is the process of finding the lemma of a word based on its meaning and part
of speech.
Stem: A stem is the root or core of a word, which may not be a valid word itself. For
example, the stem of the words “running”, “runs”, and “ran” is “run”, but the stem of the
words “happiness”, “happy”, and “happily” is “happi”. Stemming is the process of finding the
stem of a word by removing the affixes, such as prefixes and suffixes.
Syntax: Syntax is the set of rules and principles that govern the structure and formation of
sentences in a language. Syntax deals with how words are arranged and combined to form
phrases, clauses, and sentences. Syntax also determines the grammatical functions and
relations of the words, such as subject, object, modifier, etc.
Semantics: Semantics is the study of the meaning and interpretation of words, phrases, and
sentences in a language. Semantics deals with how words are related to each other and to
the world, such as synonyms, antonyms, hyponyms, etc. Semantics also determines the
logical and pragmatic aspects of meaning, such as truth conditions, entailment, implicature,
etc.
Stopwords: Stopwords are words that are very common and have little or no significance or
relevance for the analysis or processing of natural language texts. Stopwords are usually
removed or filtered out from the texts to reduce the noise and improve the performance of
NLP tasks. Some examples of stopwords are “the”, “a”, “and”, “of”, etc.
Punctuators: Punctuators are symbols or marks that are used to separate or indicate the
boundaries, structure, or features of sentences, phrases, or words in a text. Punctuators can
also convey information about the tone, mood, or emphasis of the text. Some examples of
punctuators are “.”, “,”, “!”, “?”, “-”, etc.
Script and Smart Bot
Script bots:
These are chatbots that follow a predefined script or flowchart to respond to user inputs.
They are also known as rule-based or decision-tree chatbots. They can only handle specific
keywords or phrases that match their script, and they cannot deal with complex or
unexpected queries. They do not have any learning capabilities or artificial intelligence. They
are easy to make and integrate, but they have limited functionality and flexibility. Some
examples of script bots are:
ELIZA:- This is one of the first and most famous script bots, created in 1966 by Joseph
Weizenbaum. It simulates a psychotherapist by using pattern matching and substitution to
generate responses based on user inputs. It does not understand the meaning or context of
the conversation, but it uses keywords and phrases to create the illusion of intelligence.
Cleverbot:- This is a popular script bot that claims to learn from previous conversations
with users. It uses a large database of pre-recorded responses to generate replies based on
user inputs. It does not have any logic or reasoning, but it uses statistical methods to choose
the most likely response. It can sometimes produce humorous or nonsensical results.
Siri:- This is a well-known script bot that acts as a personal assistant for Apple devices. It
can perform various tasks such as making calls, sending messages, setting reminders,
playing music, etc. It uses natural language understanding and speech recognition to
process user inputs, but it also relies on a set of rules and commands to execute actions. It
cannot handle complex or ambiguous queries, and it often requires the user to repeat or
rephrase their inputs.
Smart bots:
These are chatbots that use artificial intelligence and machine learning to respond to user
inputs. They are also known as conversational agents or intelligent assistants. They can
handle natural and varied language, and they can deal with complex or unexpected queries.
They have learning capabilities and can adapt to new situations or user preferences. They
are difficult to make and integrate, but they have wide functionality and flexibility. Some
examples of smart bots are:
Watson Assistant : This is a powerful smart bot that uses IBM's Watson technology to
create personalised and engaging conversations with users. It can understand the intent and
context of user inputs, and it can provide relevant and accurate responses. It can also
perform actions such as booking flights, ordering food, checking weather, etc. It can learn
from user feedback and improve over time.
Google Assistant : This is a sophisticated smart bot that acts as a personal assistant for
Google devices and services. It can perform various tasks such as searching the web,
playing games, controlling smart home devices, etc. It uses natural language understanding
and speech recognition to process user inputs, and it also uses machine learning and
knowledge graphs to generate responses. It can handle complex and ambiguous queries,
and it can also engage in casual and humorous conversations.
Replika: This is a unique smart bot that acts as a friend and companion for users. It can
chat with users about anything, and it can also help them with their mental health and
well-being. It uses natural language processing and deep learning to process user inputs,
and it also uses neural networks and generative models to generate responses. It can learn
from user inputs and emotions, and it can also develop its own personality and style.
Script Bots:-
Data Analysis and Reporting: Scripts are commonly used in data analysis tasks to automate
repetitive processes like data cleaning, manipulation, and reporting. They can help in
generating reports, visualisations, and insights from large datasets more efficiently than
manual methods.
Smart Bots:-
Virtual Assistants: Smart bots are used as virtual assistants in various applications such as
customer service, scheduling appointments, and providing information. These bots use
natural language processing and AI algorithms to understand and respond to user queries in
real-time.
Healthcare Monitoring: Smart bots are employed in healthcare for patient monitoring and
engagement. They can collect and analyze patient data, provide medication reminders, offer
health tips, and even detect potential health issues based on user input or sensor data.
Stemming: Stemming is the process of removing the suffixes of words to get their stem,
which may not be a valid word itself. For example, the stem of the words “running”, “runs”,
and “ran” is “run”, but the stem of the words “happiness”, “happy”, and “happily” is “happi”.
Stemming is usually faster than lemmatization, as it simply chops off the ends of words
without looking at the context or the dictionary. However, stemming may return words that
are not actual words or that change the meaning of the original word.
Speed: Stemming is usually faster than lemmatization, as it simply chops off the ends of
words without looking at the context or the dictionary. Lemmatization, on the other hand,
requires more processing and resources to find the correct lemma of a word.
Accuracy: Lemmatization is usually more accurate than stemming, as it returns valid words
that match the part of speech and the meaning of the original word. Stemming, on the other
hand, may return words that are not actual words or that change the meaning of the original
word.
Application: Depending on the goal and the domain of your natural language processing
task, you may prefer stemming or lemmatization. For example, if you are doing text
classification or sentiment analysis, you may want to use lemmatization to preserve the
semantic information of the words. If you are doing information retrieval or keyword
extraction, you may want to use stemming to increase the recall of your search results.
There is no definitive answer to when to use stemming and when to use lemmatization, as it
depends on your specific needs and preferences.
Preprocessing: You would collect a large amount of text data from various sources, such as
books, articles, blogs, or forums, that contain information about NLP. You would then clean
and prepare the data by removing punctuation, stopwords, numbers, and other irrelevant or
noisy information. You would also label the data with the corresponding questions and
answers, such as “What is NLP?” and “NLP stands for Natural Language Processing.”
Tokenization: You would break the data into smaller units, such as words or sentences. You
would split the text into a list of words, such as [“NLP”, “stands”, “for”, “Natural”, “Language”,
“Processing”], or into a list of sentences, such as [“NLP stands for Natural Language
Processing.”, “It is the branch of Artificial Intelligence that gives the ability to machine
understand and process human languages.”]
Feature extraction: You would transform the data into a numerical representation that can
be used by machine learning algorithms. You would create a bag-of-words or a word
embedding for the text, which are ways of representing the words as vectors of numbers.
For example, a bag-of-words would assign a unique index to each word and count its
frequency in the text, such as {“NLP”: 1, “stands”: 1, “for”: 1, “Natural”: 1, “Language”: 1,
“Processing”: 1}. A word embedding would assign a vector of numbers to each word that
captures its meaning and context, such as {“NLP”: [0.2, -0.3, 0.4, …], “stands”: [-0.1, 0.5,
-0.2, …], “for”: [0.3, -0.4, 0.1, …], and so on.
Modelling: You would apply machine learning algorithms to the data to perform the desired
task. You would use a classifier to predict the answer to a given question, or a neural
network to generate a response to a given input. For example, a classifier would take a
question as input, such as “What is NLP?”, and output the most likely answer, such as “NLP
stands for Natural Language Processing.” A neural network would take an input as input,
such as “Hi, I want to learn NLP.”, and output a response, such as “Hello, this is Zing. I can
help you with learning NLP.”
Evaluation: You would measure the performance and accuracy of the model on the data.
You would use metrics such as precision, recall, F1-score to evaluate the model’s output.
For example, precision would measure how many of the answers or responses generated by
the model are correct, recall would measure how many of the correct answers or responses
are generated by the model, F1-score would measure the balance between precision and
recall.
TEXT NORMALISATION
Original Text
Sentence Segmentation:- This step involves breaking up the text into sentences. A
sentence is usually defined as a unit of text that ends with a period, question mark, or
exclamation mark.
Maybe such a discussion would result in the determination that "meant to be" is exactly as
meaningless a phrase as it seems to be, and that none of us is actually meant to be doing
anything at all.
Tokenization:- This step involves breaking up each sentence into words or tokens. A
token is usually defined as a unit of text that is separated by spaces or punctuation
marks.
Output:-
I
guess
we
could
discuss
the
implications
of
the
phrase
"meant
to
be
."
That
is
if
we
wanted
to
drown
ourselves
in
a
sea
of
backwardly
referential
semantics
and
other
mumbo-jumbo
.
Maybe
such
a
discussion
would
result
in
the
determination
that
"meant
to
be
"
is
exactly
as
meaningless
a
phrase
as
it
seems
to
be
,
and
that
none
of
us
is
actually
meant
to
be
doing
anything
at
all
.
But
that
's
my
existential
underpants
underpinnings
showing
.
It
's
the
way
the
cookie
crumbles
.
And
now
I
want
a
cookie
.
Removing Stopwords, punctuators and numbers:- This step involves removing words
that have little meaning or relevance, such as articles, conjunctions, and prepositions.
It also involves removing punctuation marks and numbers that are not essential for
the text analysis.
Output:-
guess
could
discuss
implications
phrase
meant
wanted
drown
sea
backwardly
referential
semantics
mumbo-jumbo
Maybe
discussion
result
determination
meant
exactly
meaningless
phrase
seems
none
actually
meant
anything
But
existential
underpants
underpinnings
showing
way
cookie
crumbles
want
cookie
Common Case conversion :- This step involves converting all the letters of the text to
one case, either lowercase or uppercase, to standardise the text. This step is useful
when working with text data that contains a mix of uppercase and lowercase letters.
Output:-
guess
could
discuss
implications
phrase
meant
wanted
drown
sea
backwardly
referential
semantics
mumbo-jumbo
maybe
discussion
result
determination
meant
exactly
meaningless
phrase
seems
none
actually
meant
anything
but
existential
underpants
underpinnings
showing
way
cookie
crumbles
want
cookie
Stemming and lemmatization: These steps involve reducing the words to their root
forms, either by removing the suffixes (stemming) or by using a dictionary or a
linguistic rule (lemmatization). These steps are useful when working with text data
that contains different forms of the same word, such as singular and plural, or tense
and aspect.