Stars
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Every front-end GUI client for ChatGPT, Claude, and other LLMs
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices
CUDA Templates and Python DSLs for High-Performance Linear Algebra
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment…
This repository contains integer operators on GPUs for PyTorch.
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & V…
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
A framework for few-shot evaluation of language models.
📰 Must-read papers and blogs on Speculative Decoding ⚡️
FlashInfer: Kernel Library for LLM Serving
Measuring Massive Multitask Language Understanding | ICLR 2021
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
SGLang is a fast serving framework for large language models and vision language models.
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Fast and memory-efficient exact attention
A high-throughput and memory-efficient inference and serving engine for LLMs
OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
Documentation for Colossal-AI
Making large AI models cheaper, faster and more accessible
A collection of AWESOME things about HUGE AI models.