Famous Vision Language Models and Their Architectures
-
Updated
Feb 24, 2025 - Markdown
Famous Vision Language Models and Their Architectures
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Reinforcement Learning of Vision Language Models with Self Visual Perception Reward
ComfyUI-QwenVL custom node integrates the Qwen-VL series, including the latest Qwen3-VL models, including Qwen2.5-VL and the latest Qwen3-VL, to enable advanced multimodal AI for text generation, image understanding, and video analysis.
Mark web pages for use with vision-language models
🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an open framework designed for seamless integration of vision and language tasks.
Qwen-VL base model for use with Autodistill.
creates text from video and audio using Qwen-VL and Whisper
Generate vivid, human-like captions for portrait images using the Qwen2.5-VL-7B model. Outputs dense descriptions covering emotion, posture, clothing, and environment.
A computer vision system for automated analysis of index cards from a collection of coin forgeries using Qwen2.5-VL vision-language model. Developed for the imagines nummorum project.
Add a description, image, and links to the qwen-vl topic page so that developers can more easily learn about it.
To associate your repository with the qwen-vl topic, visit your repo's landing page and select "manage topics."