Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
transformers
moe
data-parallelism
distributed-optimizers
model-parallelism
megatron
mixture-of-experts
pipeline-parallelism
huggingface-transformers
megatron-lm
tensor-parallelism
large-scale-language-modeling
3d-parallelism
zero-1
sequence-parallelism
-
Updated
Dec 14, 2023 - Python