All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- speech mining with SpeechMatrix
- ALTI+
- BLASER
- many tests for the mining pipeline and different modules of
stopes
- the
Launcher
can now retry jobs when running on a flaky slurm cluster - different margin implementations in mining
- possibility to take the best neighbour when running the margin instead of the first one (fast)
- mine large datasets by splitting them in sub-languages
- when mining, keep metadata about what pairs come from the forward and backward pass
- when mining, choose if you want to do only forward, backward or both passes
- embeddings for mining are now stored in real npy files with headers
StopesModule
is notasync
anymore, just the APIs ofLauncher
. You should write yourrun
function as a normal non-async function- mining neighbours is now optimized to have a smaller memory load
- progress bar of pipelines is simplified to avoid overly busy logs
- do not rely on existing line count files and compute them as part of the pipeline in the mining
- many improvements in the mining code
- many fixes in the NMT eval pipeline
Initial release