Stars
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high …
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
how to optimize some algorithm in cuda.
Official inference repo for FLUX.1 models
📝A simple and elegant markdown editor, available for Linux, macOS and Windows.
Aiming to integrate most existing feature caching-based diffusion acceleration schemes into a unified framework.
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
📚 Collection of awesome generation acceleration resources.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Distributed Compiler based on Triton for Parallel Systems
Efficient vision foundation models for high-resolution generation and perception.
A Unified Cache Acceleration Framework for 🤗Diffusers: Qwen-Image-Lightning, Qwen-Image, HunyuanImage, Wan, FLUX, etc.
Pruna is a model optimization framework built for developers, enabling you to deliver faster, more efficient models with minimal overhead.
💥 EasyCaching is an open source caching library that contains basic usages and some advanced usages of caching which can help us to handle caching more easier!
Faster generation with text-to-image diffusion models.
Fast and memory-efficient exact attention
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
PyTorch native quantization and sparsity for training and inference
StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation
[NeurIPS 2025] Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation
The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.