NLP Preparing The Text Data (Part I)

Uploaded by

This document discusses preparing text data for natural language processing (NLP) models. It explains that text needs to be converted to numerical features like word indexes or vectors before being fed into machine learning models. Common techniques for this include tokenization to split text into individual words and converting words into integers using libraries like Keras. While these work well for English, other languages like Khmer require custom approaches like using a trie data structure to handle the lack of spaces between words. The next steps will apply tries to segment Khmer text and discuss word segmentation further.

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

NLP Preparing The Text Data (Part I)

Uploaded by

learnit learnit

0% found this document useful (0 votes)

51 views2 pages

Original Description:

NLP Preparing the Text Data(Part I)

Original Title

NLP Preparing the Text Data(Part I)

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Download as docx, pdf, or txt

0% found this document useful (0 votes)

51 views2 pages

NLP Preparing The Text Data (Part I)

Uploaded by

learnit learnit

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Download as docx, pdf, or txt

Jump to Page

You are on page 1of 2

Search inside document

NLP: Preparing the Text Data(Part I)

PythonMachine LearningNLP
In my previous article, I talk about Build a Trie Data structure where I mentioned that
it's usefull for NLP. Now, let's discause why/how we use it to build NLP model.
As we all know, machine learning(deep learning) is a process of using mathematics
operation to calculate then hypothesis from given input variables(numbers) via some
equations. Therefore, we need to convert any input data(text, image, audio, video,..)
into a form of numbers then we can fit those number into our prepared model. For
NLP, we do a text "Tokenization" process to breaks up a stream of characters into
individual words(tokens). For example, it would convert the text "Framgia is
awesome" into [Framgia, is, awesome]. Then we can turn each of those
words(tokens) into a sequence of integers (each integer being the index of a token in
a dictionary) or into a vector. So that, it can be fited into our NLP model.
Tokenization
We can write a very simple python code to seperate text into array of words with
ease.
str = "Framgia is awesome"
str.split(" ")
# ['Framgia', 'is', 'awesome']

And we also can convert those array of words into a sequence of numbers from
scratch. However, there are many libraries where we can use to do this job. We
choose to use library instead of building our own because the code in those libraries
had been tested and used by many researchers. Where our scratch code may miss
some points since we write it in a short of time. So now let's use one of those
libraries which is Keras.
from keras.preprocessing.text import text_to_word_sequence
text = "Framgia is awesome"
result = text_to_word_sequence(text)
print(result)
# ['framgia', 'is', 'awesome']

And we can convert them into array of numbers via

keras one_hot or hashing_trick function.
Encoding with one_hot which encodes a text into a list of word indexes of size n.
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
text = "Framgia is awesome, Let make it awesome"
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# 6
result = one_hot(text, round(vocab_size*1.3))
print(result)
# [7, 6, 2, 6, 6, 6, 2]
Encoding with hashing_trick which converts a text to a sequence of indexes in a
fixed-size hashing space.
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
text = "Framgia is awesome, Let make it awesome"
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# 6
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)
#[3, 5, 2, 4, 6, 5, 2]

Other languages?
All "Tokenization" methods above really work well with English text. How about other
langauges? Ex: Khmer langauge. As you may know in Khmer langauge there isn't rule
for adding space for seperating word. That why we build a custom structure(Trie) for
separating the word in this language.
Resources
 https://machinelearningmastery.com/clean-text-machine-learning-python/
 https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/
 https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
 https://keras.io/preprocessing/text/
Next Step
We have learned about why we need to prepare input text for NLP. We also explore
some text "Tokenization" methods and put them into practice. Moreover, we also use
the code from the previous article to segment word for some special language such
as Khmer. Next step, we will apply Trie for separating the Khmer language and we
will discuss more about word segmentation.
Have fun.

Glove
Document10 pages
Glove
tareqeee15
100% (1)
Next Word Prediction With NLP and Deep Learning
Document13 pages
Next Word Prediction With NLP and Deep Learning
Alebachew Mekuriaw
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
Document15 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
pradeep_dhote9
No ratings yet
Rajeev Mishra 20 SCSE1180087
Document29 pages
Rajeev Mishra 20 SCSE1180087
SHR extreme
No ratings yet
NLP Manual
Document21 pages
NLP Manual
1nt21ai012.vynavi
No ratings yet
Python Automation
Document54 pages
Python Automation
Deva K
No ratings yet
ASTW RA03 PracticalManual
Document18 pages
ASTW RA03 PracticalManual
Diksha Nasa
No ratings yet
Python Chatbot Project
Document10 pages
Python Chatbot Project
Vanitha G
No ratings yet
DL Practical 09text Pre Processing
Document6 pages
DL Practical 09text Pre Processing
tkalyankar200
No ratings yet
NLP Lab Manual
Document38 pages
NLP Lab Manual
ramnathjhrav
No ratings yet
Recurrent Neural Networks Tutorial, Part 2
Document16 pages
Recurrent Neural Networks Tutorial, Part 2
hoja
No ratings yet
DL 8
Document7 pages
DL 8
singhbusiness143
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
Document16 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
nana555550
No ratings yet
Text Mining in R: A Tutorial
Document7 pages
Text Mining in R: A Tutorial
meenana
No ratings yet
Deep Spelling Data Collection
Document6 pages
Deep Spelling Data Collection
learnit learnit
No ratings yet
Detecting Spam in Emails. Applying NLP and Deep Learning For Spam - by Ramya Vidiyala - Towards Data Science
Document23 pages
Detecting Spam in Emails. Applying NLP and Deep Learning For Spam - by Ramya Vidiyala - Towards Data Science
Dương Vũ Minh
No ratings yet
Word Embedding Generation For Telugu Corpus
Document28 pages
Word Embedding Generation For Telugu Corpus
Durga P
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
Document12 pages
Medical Text Classifier GabrieldeOlaguibel
gabriel-l
No ratings yet
FineTune OPUS MT Engine
Document9 pages
FineTune OPUS MT Engine
Leon
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
Document6 pages
CSDM2-Text Preprocessing For NL Data - 011050
ignaciojudyann596
No ratings yet
All Practicals
Document33 pages
All Practicals
Ayesha Shaikh
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
Document17 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
Shubham Jade
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
Document34 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
rahacse
No ratings yet
Named Entity Recognition Using Deep Learning
Document21 pages
Named Entity Recognition Using Deep Learning
Zerihun Yitayew
100% (1)
Module 2 Feature Engineering and Text Representation
Document19 pages
Module 2 Feature Engineering and Text Representation
raonithin252
No ratings yet
CNN Text Classification
Document12 pages
CNN Text Classification
孙亚童
No ratings yet
AI Phash3
Document11 pages
AI Phash3
techusama4
No ratings yet
Cse425 Assignement - 20101257
Document12 pages
Cse425 Assignement - 20101257
sudipta nandi
No ratings yet
PHP Interview Questions and Answers
Document55 pages
PHP Interview Questions and Answers
Vidyakumar Livingforoldpeople
No ratings yet
01 NLP - Merged Vinay
Document27 pages
01 NLP - Merged Vinay
kambleyash1412
No ratings yet
Word Embeddings Notes
Document9 pages
Word Embeddings Notes
Abhimanyu
No ratings yet
02 Variables Builtin Functions
Document8 pages
02 Variables Builtin Functions
AARUNI RAI
No ratings yet
D22dce179 Ai Practical-3,4
Document6 pages
D22dce179 Ai Practical-3,4
Vishv Faldu
No ratings yet
PHP Interview Questions and Answers
Document39 pages
PHP Interview Questions and Answers
Dolphin Rajesh
No ratings yet
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
Document39 pages
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
zishankamal
No ratings yet
NLP Lab Manual (R20)
Document24 pages
NLP Lab Manual (R20)
Gopi Naveen
50% (2)
Ai & ML Week-11
Document32 pages
Ai & ML Week-11
ಹರಿ ಶಂ
No ratings yet
ATN
Document16 pages
ATN
xyzking
100% (1)
Attention Is All You Need Paper Explained Well
Document18 pages
Attention Is All You Need Paper Explained Well
chahoub
No ratings yet
6 - Text Vectorization-CSC688-SP22
Document5 pages
6 - Text Vectorization-CSC688-SP22
Crypto Genius
No ratings yet
Keras For Beginners: Implementing A Recurrent Neural Network
Document13 pages
Keras For Beginners: Implementing A Recurrent Neural Network
Robert DEL POPOLO
No ratings yet
NLP For ML - Spam Classifier
Document14 pages
NLP For ML - Spam Classifier
Thomas West
No ratings yet
Different Methods For Calculating Sentiment of Text
Document8 pages
Different Methods For Calculating Sentiment of Text
Adarsh
No ratings yet
Keyword Clustering
Document15 pages
Keyword Clustering
JACOB RACHUONYO
No ratings yet
Practical and Effective Neural NER
Document31 pages
Practical and Effective Neural NER
wcc32
No ratings yet
Information Security Awareness - Refresher Course
Document83 pages
Information Security Awareness - Refresher Course
sai damodar
100% (2)
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
Document13 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
Sebastian Correa
No ratings yet
NLP Lab1
Document6 pages
NLP Lab1
karthikeyacharan78
No ratings yet
Dsaa Group Project
Document3 pages
Dsaa Group Project
msroshi madhu
No ratings yet
How To Create DataFrame in Python
Document6 pages
How To Create DataFrame in Python
aaaa
No ratings yet
Advanced NLP With Spacy Chapter2
Document28 pages
Advanced NLP With Spacy Chapter2
Fgpeqw
100% (1)
Over Description About The Model
Document3 pages
Over Description About The Model
www.santhoshvjd123
No ratings yet
Tokenizer
Document4 pages
Tokenizer
Asmar Hajizada
No ratings yet
NLP Exps
Document10 pages
NLP Exps
20115016 HICET STUDENT AIML
No ratings yet
Python NLP
Document15 pages
Python NLP
Pierre Tibokbe
No ratings yet
NLP Lab Manual
Document16 pages
NLP Lab Manual
adarsh24jdp
No ratings yet
Introduction
Document17 pages
Introduction
Kishan hari
No ratings yet
Tsa Lab Record - Cse
Document53 pages
Tsa Lab Record - Cse
jerujef.2723
No ratings yet
Programming Questions
Document5 pages
Programming Questions
Hari Sree. M
No ratings yet
Javascript - 50 functions and tutorial
From Everand
Javascript - 50 functions and tutorial
Nino Paiotta
No ratings yet
Checklist - SRS Review
Document2 pages
Checklist - SRS Review
learnit learnit
No ratings yet
Checklist - Test Case Review
Document3 pages
Checklist - Test Case Review
learnit learnit
No ratings yet
Nha Be Garment Corporation - JSC: Nhabe Garment'S Profile
Document51 pages
Nha Be Garment Corporation - JSC: Nhabe Garment'S Profile
learnit learnit
No ratings yet
B NG C U Chương Nhân
Document4 pages
B NG C U Chương Nhân
learnit learnit
No ratings yet
Địa Chỉ Ipv4 Và Subnet Mask
Document49 pages
Địa Chỉ Ipv4 Và Subnet Mask
learnit learnit
No ratings yet
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
Document65 pages
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
learnit learnit
No ratings yet
Deep Spelling Data Collection
Document6 pages
Deep Spelling Data Collection
learnit learnit
No ratings yet
Emgucv - OCRForm - Cs at Master Emgucv - Emgucv GitHub
Document8 pages
Emgucv - OCRForm - Cs at Master Emgucv - Emgucv GitHub
learnit learnit
No ratings yet
B NG C U Chương Nhân
Document4 pages
B NG C U Chương Nhân
learnit learnit
No ratings yet
Perspective Summer2012
Document24 pages
Perspective Summer2012
api-161635461
No ratings yet
Mind Change: How Digital Technologies Are Leaving Their Mark On Our Brains
Document4 pages
Mind Change: How Digital Technologies Are Leaving Their Mark On Our Brains
Impact Journals
No ratings yet
P11865 (21%)
Document48 pages
P11865 (21%)
Wycliffe Asman
No ratings yet
ZZS M Knjiga Predmeta
Document45 pages
ZZS M Knjiga Predmeta
Dragan Savić
No ratings yet
DLL Science Grade8 Quarter1 Week2
Document8 pages
DLL Science Grade8 Quarter1 Week2
Gerald E Baculna
No ratings yet
Y. Evaporator TC 15000 PDF
Document187 pages
Y. Evaporator TC 15000 PDF
Gloria Del Carmen Muñoz
No ratings yet
Caja Santa Fe Manual
Document15 pages
Caja Santa Fe Manual
Abraham Vega
No ratings yet
Inaba - Brochure - New 83e26-2622 - 102
Document4 pages
Inaba - Brochure - New 83e26-2622 - 102
Denis William Librahim Rahardjo G
No ratings yet
Statement Issued by Archdiocese of Baltimore On Sexual Abuse Allegations
Document2 pages
Statement Issued by Archdiocese of Baltimore On Sexual Abuse Allegations
Adam Thompson
No ratings yet
06 Introduction
Document9 pages
06 Introduction
Sunny Tuvar
No ratings yet
Kuwait English
Document49 pages
Kuwait English
fmt deep66666
No ratings yet
Listening Activity
Document1 page
Listening Activity
Erinn Mogollon
No ratings yet
Princeton 0514
Document20 pages
Princeton 0514
elauwit
No ratings yet
Role of Play Therapy in Childhood Grief: A Case Report: December 2016
Document73 pages
Role of Play Therapy in Childhood Grief: A Case Report: December 2016
Ram Jay
No ratings yet
Fiji Hindi - English Dictionary
Document52 pages
Fiji Hindi - English Dictionary
Julia Andeng
No ratings yet
B07-001 - Construction of Hoover Dam - US
Document87 pages
B07-001 - Construction of Hoover Dam - US
Julia Andeng
No ratings yet
Samuel Berro 221 Kingsbury Ave DEARBORN, MI 48128-1552: PO BOX 2577 OMAHA NE 68103-2577
Document9 pages
Samuel Berro 221 Kingsbury Ave DEARBORN, MI 48128-1552: PO BOX 2577 OMAHA NE 68103-2577
Israe Berate
No ratings yet
Comparative Study Between Private Sector Banks and Public Sector Banks
Document68 pages
Comparative Study Between Private Sector Banks and Public Sector Banks
Rhytz Singh
100% (1)
TRAXX Asia Data Sheet, Draft R05
Document4 pages
TRAXX Asia Data Sheet, Draft R05
Yunendar
100% (1)
Abnormal Psychology Comer 8th Edition Test Bank
Document56 pages
Abnormal Psychology Comer 8th Edition Test Bank
Jeffery Davis
100% (39)
Nguyễn Hà Vy (28.12.2005)
Document34 pages
Nguyễn Hà Vy (28.12.2005)
ngocsam1021
No ratings yet
2112618005! PDF
Document31 pages
2112618005! PDF
Andrea Isabel U. O'Dell
No ratings yet
RISE Mzansi Plans For Gauteng Outlined by Vuyiswa Ramokgopa
Document35 pages
RISE Mzansi Plans For Gauteng Outlined by Vuyiswa Ramokgopa
anastasi mankese mokgobu
No ratings yet
662 Cisco rv042 Dual Wan VPN Router PDF
Document4 pages
662 Cisco rv042 Dual Wan VPN Router PDF
mavi27
No ratings yet
1200 Series: 1204E-E44TA/TTA IOPU
Document4 pages
1200 Series: 1204E-E44TA/TTA IOPU
yorgo7002
100% (1)
ME51N-Create A Purchase Requisition For Services
Document6 pages
ME51N-Create A Purchase Requisition For Services
Andrea Ellis
No ratings yet
Cyberpunk Ad
Document6 pages
Cyberpunk Ad
Amanda Alexander
No ratings yet
Lecture 4
Document8 pages
Lecture 4
nkalampumelelo82
No ratings yet
Azure Devops Syllabus
Document7 pages
Azure Devops Syllabus
Anil N
100% (1)
Physical Characteristics
Document2 pages
Physical Characteristics
Pcd 'Oriinz' Oreenz
No ratings yet