clj-tokenizer

A simple Clojure wrapper around the Lucene text tokenizer. A wrapper for the Lucene StandardAnalyzer and Lucene StandardTokenizer are provided.

For a proper Clojure library for NLP see clojure-nlp.

The project can run from the command line and will tokenize each line of stdin, remove stopwords and write to stdout.

Usage

First clone the project. Then set up your

  lein deps
  lein compile; lein uberjar

For example, to use the tokenizer from the command line use java -jar

  curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt | java -jar clj-tokenizer-0.1.0-SNAPSHOT-standalone.jar | head -100

will tokenize Herman Melville's Moby Dick.

To use the tokenizer within Clojure first add the dependency to project.clj

  [clj-tokenizer "0.1.0"]

To create a token stream:

  (token-seq (token-stream "This is a string."))
  ;; ("This" "is" "a" "string")

To convert to lowercase and remove stopwords:

  (token-seq (token-stream-without-stopwords "This is a string, without the stopwords."))
  ;; ("string" "without" "stopwords")

To stem the words using the Snowball stemmer:

  (token-seq (stemmed (token-stream-without-stopwords "Going to be Stemming some lemmings.")))
  ;; ("go" "stem" "some" "lem")

Distributed under the Eclipse Public License, the same as Clojure.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src/clj_tokenizer		src/clj_tokenizer
test/clj_tokenizer/test		test/clj_tokenizer/test
.gitignore		.gitignore
README.md		README.md
project.clj		project.clj