quntoken

New Hungarian tokenizer based on quex and huntoken. This tool is also integrated into the e-magyar language processing system under the name emToken.

Requirements

OS: linux x86-64
python 3.6+

Developer requirements:

python 2.7 (for quex)
g++ = 5

Install

pip3 install quntoken

Usage

Command line

quntoken reads plain text in UTF-8 from STDIN and writes to STDOUT.

The default (and recommended) format of output is TSV. It has two columns. The first contains the token, the second contains the white space sequence after the token. Sentence boundaries are marked with empty lines.

Example: tokenizing input.txt file, writing the TSV output into output.tsv file.

quntoken <input.txt >output.tsv

Optional arguments:

  -h, --help            show this help message and exit
  -f {json,raw,spl,tsv,xml}, --form {json,raw,spl,tsv,xml}
                        Valid formats: json, tsv, xml and spl (sentence per
                        line, ignores mode). Default format: tsv.
  -m {sentence,token}, --mode {sentence,token}
                        Modes: sentence or token (does not apply for
                        form=spl). Default: token
  -c, --conll-text      Add CoNLL text metafield to contain the detokenized
                        sentence (only for mode == token and format == tsv).
                        Default: False
  -w, --word-break      Eliminate word break from end of lines.
  -v, --version         show program's version number and exit

Python API

quntoken.tokenize(inp=sys.stdin, form='tsv', mode='token', word_break=False, conll_text=False)

Entry point, returns an iterator object. Parameters:

inp: Input iterator, default: sys.stdin.

form: Format of output. Valid formats: 'tsv' (default), 'json', 'xml' and 'spl' (sentence per line, ignores mode).

mode: 'sentence' (only sentence segmenting) or 'token' (full tokenization - default, does not apply for form=spl).

word_break: If True, eliminates word break from end of lines. Default: False.

conll_text: If True, add CoNLL text metafield to contain the detokenized sentence (Only for mode == token and format == tsv). Default: False.

Example:

from quntoken import tokenize

for tok in tokenize(open('input.txt')):
    print(tok, end='')

Name		Name	Last commit message	Last commit date
Latest commit History 329 Commits
data		data
quntoken		quntoken
src		src
test		test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements-dev.txt		requirements-dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quntoken

Requirements

Install

Usage

Command line

Python API

About

Releases 15

Packages

Contributors 5

Languages

License

nytud/quntoken

Folders and files

Latest commit

History

Repository files navigation

quntoken

Requirements

Install

Usage

Command line

Python API

About

Resources

License

Stars

Watchers

Forks

Releases 15

Packages 0

Contributors 5

Languages

Packages