Tokenizer
Tokenizer
Tokenizer
Tokenization
Word Tokenization
In [5]: text = """There are multiple ways we can perform tokenization on given text data.
We can choose any method based on langauge, library and purpose of mode
tokens = text.split()
print(tokens)
Sentence Tokenization
In [12]: text = """Characters like periods, exclamation point and newline char are used to s
['Characters like periods, exclamation point and newline char are used to separate
Out[12]:
the sentences',
'But one drawback with split() method, that we can only use one separator at a ti
me! So sentence tonenization wont be foolproof with split() method.']
Sentence Tokenization
In [17]: text = """Characters like periods, exclamation point and newline char are used to s
tokens_sent = re.compile('[.!?] ').split(text)
tokens_sent
['Characters like periods, exclamation point and newline char are used to separate
Out[17]:
the sentences.But one drawback with split() method, that we can only use one separ
ator at a time',
'So sentence tonenization wont be foolproof with split() method.']
word Tokenization
sentence Tokenization
text = """There are multiple ways we can perform tokenization on given text data. W
tokens = sent_tokenize(text)
print(tokens)
['There are multiple ways we can perform tokenization on given text data.', 'We ca
n choose any method based on langauge, library and purpose of modeling.']
word Tokenization
[There, are, multiple, ways, we, can, perform, tokenization, on, given, text, dat
a, ., We, can, choose, any, method, based, on, langauge, ,, library, and, purpose,
of, modeling, .]
sentence Tokenization
['Characters like periods, exclamation point and newline char are used to separate
the sentences.', 'But one drawback with split() method, that we can only use one s
eparator at a time!', 'So sentence tonenization wont be foolproof with split() met
hod.']
word Tokenization
text = """There are multiple ways we can perform tokenization on given text data. W
tokens = text_to_word_sequence(text)
print(tokens)
sentence Tokenization
text = """Characters like periods, exclamation point and newline char are used to s
['characters like periods, exclamation point and newline char are used to separate
Out[34]:
the sentences',
' but one drawback with split() method, that we can only use one separator at a t
ime',
' so sentence tonenization wont be foolproof with split() method']