NLP Preparing The Text Data (Part I)
NLP Preparing The Text Data (Part I)
NLP Preparing The Text Data (Part I)
PythonMachine LearningNLP
In my previous article, I talk about Build a Trie Data structure where I mentioned that
it's usefull for NLP. Now, let's discause why/how we use it to build NLP model.
As we all know, machine learning(deep learning) is a process of using mathematics
operation to calculate then hypothesis from given input variables(numbers) via some
equations. Therefore, we need to convert any input data(text, image, audio, video,..)
into a form of numbers then we can fit those number into our prepared model. For
NLP, we do a text "Tokenization" process to breaks up a stream of characters into
individual words(tokens). For example, it would convert the text "Framgia is
awesome" into [Framgia, is, awesome]. Then we can turn each of those
words(tokens) into a sequence of integers (each integer being the index of a token in
a dictionary) or into a vector. So that, it can be fited into our NLP model.
Tokenization
We can write a very simple python code to seperate text into array of words with
ease.
str = "Framgia is awesome"
str.split(" ")
# ['Framgia', 'is', 'awesome']
And we also can convert those array of words into a sequence of numbers from
scratch. However, there are many libraries where we can use to do this job. We
choose to use library instead of building our own because the code in those libraries
had been tested and used by many researchers. Where our scratch code may miss
some points since we write it in a short of time. So now let's use one of those
libraries which is Keras.
from keras.preprocessing.text import text_to_word_sequence
text = "Framgia is awesome"
result = text_to_word_sequence(text)
print(result)
# ['framgia', 'is', 'awesome']
Other languages?
All "Tokenization" methods above really work well with English text. How about other
langauges? Ex: Khmer langauge. As you may know in Khmer langauge there isn't rule
for adding space for seperating word. That why we build a custom structure(Trie) for
separating the word in this language.
Resources
https://machinelearningmastery.com/clean-text-machine-learning-python/
https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/
https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
https://keras.io/preprocessing/text/
Next Step
We have learned about why we need to prepare input text for NLP. We also explore
some text "Tokenization" methods and put them into practice. Moreover, we also use
the code from the previous article to segment word for some special language such
as Khmer. Next step, we will apply Trie for separating the Khmer language and we
will discuss more about word segmentation.
Have fun.