Understanding Language Model

Text Wrangling and
Cleansing
The previous chapter was all about you getting a head start on Python as well as
NLTK. We learned about how we can start some meaningful EDA with any corpus
of text. We did all the pre-processing part in a very crude and simple manner.
In this chapter, will go over preprocessing steps like tokenization, stemming,
lemmatization, and stop word removal in more detail. We will explore all the tools
in NLTK for text wrangling. We will talk about all the pre-processing steps used in
modern NLP applications, the different ways to achieve some of these tasks, as well
as the general do's and don'ts. The idea is to give you enough information about
these tools so that you can decide what kind of pre-processing tool you need for
your application. By the end of this chapter, readers should know :
• About all the data wrangling, and to perform it using NLTK

• What is the importance of text cleansing and what are the common tasks that
can be achieved using NLTK
What is text wrangling?

It's really hard to define the term text/data wrangling. I will define it as all the
pre-processing and all the heavy lifting you do before you have a machine readable
and formatted text from raw data. The process involves data munging, text
cleansing, specific preprocessing, tokenization, stemming or lemmatization
and stop word removal. Let's start with a basic example of parsing a csv file:
>>>import csv
>>>with open('example.csv','rb') as f:
>>> reader = csv.reader(f,delimiter=',',quotechar='"')
>>> for line in reader :
>>> print line[1] # assuming the second field is the raw sting
[ 19 ]
www.it-ebooks.info
Text Wrangling and Cleansing
Here we are trying to parse a csv, in above code line will be a list of all the column
elements of the csv. We can customize this to work on any delimiter and quoting
character. Now once we have the raw string, we can apply different kinds of text
wrangling that we learned in the last chapter. The point here is to equip you with
enough detail to deal with any day to day csv files.
A clear process flow for some of the most commonly accepted document types is
shown in the following block diagram:
Data Sources CSV HTML XML Databases Json PDF NoSQL
Import PY0DBC http:

PDFminer //stackoverflow.
HTMLparser import
Python parsers import csv https://code.google. import json https://pypi.python. comquestions/5832
SAX Parser XMLParser com/p/pyodbc/wiki/ org/pypi/pdfminer/ 531/nasql-db-for-
DOM Parser GettingStarted python
Raw text
Tokenization
Text cleansing
Stop word removal
Stemming / Lemmatization
I have listed most common data sources in the first stack of the diagram. In most
cases, the data will be residing in one of these data formats. In the next step, I have
listed the most commonly used Python wrappers around those data formats. For
example, in the case of a csv file, Python's csv module is the most robust way of
handling the csv file. It allows you to play with different splitters, different quote
characters, and so on.
The other most commonly used files are json.
[ 20 ]
www.it-ebooks.info
Chapter 2
For example, json looks like:

{
"array": [1,2,3,4],
"boolean": True,
"object": {
"a": "b"
},
"string": "Hello World"
}
Let's say we want to process the string. The parsing code will be:
>>>import json
>>>jsonfile = open('example.json')
>>>data = json.load(jsonfile)
>>>print data['string']
"Hello World"
We are just loading a json file using the json module. Python allows you to choose
and process it to a raw string form. Please have a look at the diagram to get more
details about all the data sources, and their parsing packages in Python. I have only
given pointers here; please feel free to search the web for more details about these
packages.
So before you write your own parser to parse these different document formats,
please have a look at the second row for available parsers in Python. Once you reach
a raw string, all the pre-processing steps can be applied as a pipeline, or you might
choose to ignore some of them. We will talk about tokenization, stemmers, and
lemmatizers in the next section in detail. We will also talk about the variants, and
when to use one case over the other.
Now that you have an idea of what text wrangling is, try to
connect to any one of the databases using one of the Python
modules described in the preceding image.
[ 21 ]
www.it-ebooks.info
Text Wrangling and Cleansing
Text cleansing
Once we have parsed the text from a variety of data sources, the challenge is to
make sense of this raw data. Text cleansing is loosely used for most of the cleaning
to be done on text, depending on the data source, parsing performance, external
noise and so on. In that sense, what we did in Chapter 1, Introduction to Natural
Language Processing for cleaning the html using html_clean, can be labeled as text
cleansing. In another case, where we are parsing a PDF, there could be unwanted
noisy characters, non ASCII characters to be removed, and so on. Before going on to
next steps we want to remove these to get a clean text to process further. With a data
source like xml, we might only be interested in some specific elements of the tree,
with databases we may have to manipulate splitters, and sometimes we are only
interested in specific columns. In summary, any process that is done with the aim to
make the text cleaner and to remove all the noise surrounding the text can be termed
as text cleansing. There are no clear boundaries between the terms data munging,
text cleansing, and data wrangling they can be used interchangeably in a similar
context. In the next few sections, we will talk about some of the most common pre-
processing steps while doing any NLP task.
Sentence splitter
Some of the NLP applications require splitting a large raw text into sentences to
get more meaningful information out. Intuitively, a sentence is an acceptable unit
of conversation. When it comes to computers, it is a harder task than it looks. A
typical sentence splitter can be something as simple as splitting the string on (.), to
something as complex as a predictive classifier to identify sentence boundaries:
>>>inputstring = ' This is an example sent. The sentence splitter will
split on sent markers. Ohh really !!'
>>>from nltk.tokenize import sent_tokenize
>>>all_sent = sent_tokenize(inputstring)
>>>print all_sent
[' This is an example sent', 'The sentence splitter will split on
markers.','Ohh really !!']
We are trying to split the raw text string into a list of sentences. The preceding
function, sent_tokenize, internally uses a sentence boundary detection algorithm
that comes pre-built into NLTK. If your application requires a custom sentence
splitter, there are ways that we can train a sentence splitter of our own:
>>>import nltk.tokenize.punkt
>>>tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
[ 22 ]
www.it-ebooks.info
Chapter 2
The preceding sentence splitter is available in all the 17 languages. You just need to
specify the respective pickle object. In my experience, this is good enough to deal
with a variety of the text corpus, and there is a lesser chance that you will have to
build your own.
Tokenization
A word (Token) is the minimal unit that a machine can understand and process. So
any text string cannot be further processed without going through tokenization.
Tokenization is the process of splitting the raw string into meaningful tokens. The
complexity of tokenization varies according to the need of the NLP application, and
the complexity of the language itself. For example, in English it can be as simple as
choosing only words and numbers through a regular expression. But for Chinese and
Japanese, it will be a very complex task.
>>>s = "Hi Everyone ! hola gr8" # simplest tokenizer
>>>print s.split()
['Hi', 'Everyone', '!', 'hola', 'gr8']
>>>from nltk.tokenize import word_tokenize
>>>word_tokenize(s)
['Hi', 'Everyone', '!', 'hola', 'gr8']
>>>from nltk.tokenize import regexp_tokenize, wordpunct_tokenize,
blankline_tokenize
>>>regexp_tokenize(s, pattern='\w+')
['Hi', 'Everyone', 'hola', 'gr8']
>>>regexp_tokenize(s, pattern='\d+')
['8']
>>>wordpunct_tokenize(s)
['Hi', ',', 'Everyone', '!!', 'hola', 'gr8']
>>>blankline_tokenize(s)
['Hi, Everyone !! hola gr8']
In the preceding code we have used various tokenizers. To start with we used the
simplest: the split() method of Python strings. This is the most basic tokenizer, that
uses white space as delimiter. But the split() method itself can be configured for
some more complex tokenization. In the preceding example, you will find hardly a
difference between the s.split() and word_tokenize methods.
[ 23 ]
www.it-ebooks.info

Understanding Language Model

Uploaded by

Copyright:

Available Formats

Understanding Language Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Language Model

Uploaded by

Copyright:

Available Formats

Text Wrangling and

• About all the data wrangling, and to perform it using NLTK

What is text wrangling?

Data Sources CSV HTML XML Databases Json PDF NoSQL

Import PY0DBC http:

Stop word removal

The other most commonly used files are json.

For example, json looks like:

You might also like