text-clustering

Install

pip install scikit-learn umap-learn sentence_transformers faiss-cpu plotly matplotlib datasets

Usage

Run pipeline and visualize results:

from src.text_clustering import ClusterClassifier
from datasets import load_dataset

SAMPLE = 100_000

texts = load_dataset("HuggingFaceFW/FW-12-12-2023-CC-2023-06", split="train").select(range(SAMPLE))["content"]

cc = ClusterClassifier(embed_device="mps")

# run the pipeline:
embs, labels, summaries = cc.fit(texts)

# show the results
cc.show()

# save 
cc.save("./cc_100k")

Load classifier and run inference:

from src.text_clustering import ClusterClassifier

cc = ClusterClassifier(embed_device="mps")

# load state
cc.load("./cc_100k")

# visualize
cc.show()

# classify new texts with k-nearest neighbour search
cluster_labels, embeddings = cc.infer(some_texts, top_k=1)

You can also run the pipeline using a script with:

# run a new pipeline
python run_pipeline.py --mode run  --save_load_path './cc_100k' --n_samples 100000 --build_hf_ds
# load existing pipeline
python run_pipeline.py --mode load --save_load_path './cc_100k' --build_hf_ds
# inference mode on new texts from an input dataset
python run_pipeline.py --mode infer --save_load_path './cc_100k'  --n_samples <NB_INFERENCE_SAMPLES> --input_dataset <HF_DATA_FOR_INFERENCE>

The build_hf_ds flag builds and pushes HF datasets, for the files and clusters, that can be directly used in the FW visualization space. In infer mode, we push the clusters dataset by default.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
plots		plots
src		src
README.md		README.md
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-clustering

Install

Usage

About

Releases

Packages

Contributors 2

Languages

License

huggingface/text-clustering

Folders and files

Latest commit

History

Repository files navigation

text-clustering

Install

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages