Releases: ollama/ollama
v0.12.6
What's Changed
- Ollama's app now supports searching when running DeepSeek-V3.1, Qwen3 and other models that support tool calling.
- Flash attention is now enabled by default for Gemma 3, improving performance and memory utilization
- Fixed issue where Ollama would hang while generating responses
- Fixed issue where
qwen3-coder
would act in raw mode when using/api/generate
orollama run qwen3-coder <prompt>
- Fixed
qwen3-embedding
providing invalid results - Ollama will now evict models correctly when
num_gpu
is set - Fixed issue where
tool_index
with a value of0
would not be sent to the model
Experimental Vulkan Support
Experimental support for Vulkan is now available when you build locally from source. This will enable additional GPUs from AMD, and Intel which are not currently supported by Ollama. To build locally, install the Vulkan SDK and set VULKAN_SDK in your environment, then follow the developer instructions. In a future release, Vulkan support will be included in the binary release as well. Please file issues if you run into any problems.
New Contributors
- @yajianggroup made their first contribution in #12377
- @inforithmics made their first contribution in #11835
- @sbhavani made their first contribution in #12619
Full Changelog: v0.12.5...v0.12.6
v0.12.5
What's Changed
- Thinking models now support structured outputs when using the
/api/chat
API - Ollama's app will now wait until Ollama is running to allow for a conversation to be started
- Fixed issue where
"think": false
would show an error instead of being silently ignored - Fixed
deepseek-r1
output issues - macOS 12 Monterey and macOS 13 Ventura are no longer supported
- AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm. We're working to support these GPUs via Vulkan in a future release.
New Contributors
- @shengxinjing made their first contribution in #12415
Full Changelog: v0.12.4...v0.12.5-rc0
v0.12.4
What's Changed
- Flash attention is now enabled by default for Qwen 3 and Qwen 3 Coder
- Fixed minor memory estimation issues when scheduling models on NVIDIA GPUs
- Fixed an issue where
keep_alive
in the API would accept different values for the/api/chat
and/api/generate
endpoints - Fixed tool calling rendering with
qwen3-coder
- More reliable and accurate VRAM detection
OLLAMA_FLASH_ATTENTION
can now be overridden to0
for models that have flash attention enabled by default- macOS 12 Monterey and macOS 13 Ventura are no longer supported
- Fixed crash where templates were not correctly defined
- Fix memory calculations on NVIDIA iGPUs
- AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm. We're working to support these GPUs via Vulkan in a future release.
New Contributors
Full Changelog: v0.12.3...v0.12.4-rc3
v0.12.3
New models
-
DeepSeek-V3.1-Terminus: DeepSeek-V3.1-Terminus is a hybrid model that supports both thinking mode and non-thinking mode. It delivers more stable & reliable outputs across benchmarks compared to the previous version:
Run on Ollama's cloud:
ollama run deepseek-v3.1:671b-cloud
Run locally (requires 500GB+ of VRAM)
ollama run deepseek-v3.1
-
Kimi-K2-Instruct-0905: Kimi K2-Instruct-0905 is the latest, most capable version of Kimi K2. It is a state-of-the-art mixture-of-experts (MoE) language model, featuring 32 billion activated parameters and a total of 1 trillion parameters.
ollama run kimi-k2:1t-cloud
What's Changed
- Fixed issue where tool calls provided as stringified JSON would not be parsed correctly
ollama push
will now provide a URL to follow to sign in- Fixed issues where qwen3-coder would output unicode characters incorrectly
- Fix issue where loading a model with
/load
would crash
New Contributors
Full Changelog: v0.12.2...v0.12.3
v0.12.2
Web search
A new web search API is now available in Ollama. Ollama provides a generous free tier of web searches for individuals to use, and higher rate limits are available via Ollamaβs cloud. This web search capability can augment models with the latest information from the web to reduce hallucinations and improve accuracy.
What's Changed
- Models with Qwen3's architecture including MoE now run in Ollama's new engine
- Fixed issue where built-in tools for gpt-oss were not being rendered correctly
- Support multi-regex pretokenizers in Ollama's new engine
- Ollama's new engine can now load tensors by matching a prefix or suffix
Full Changelog: v0.12.1...v0.12.2
v0.12.1
New models
- Qwen3 Embedding: state of the art open embedding model by the Qwen team
What's Changed
- Qwen3-Coder now supports tool calling
- Ollama's app will now longer show "connection lost" in error when connecting to cloud models
- Fixed issue where Gemma3 QAT models would not output correct tokens
- Fix issue where
&
characters in Qwen3-Coder would not be parsed correctly when function calling - Fixed issues where
ollama signin
would not work properly on Linux
Full Changelog: v0.12.0...v0.12.1
v0.12.0
Cloud models
Cloud models are now available in preview, allowing you to run a group of larger models with fast, datacenter-grade hardware.
To run a cloud model, use:
ollama run qwen3-coder:480b-cloud
What's Changed
- Models with the Bert architecture now run on Ollama's engine
- Models with the Qwen 3 architecture now run on Ollama's engine
- Fix issue where older NVIDIA GPUs would not be detected if newer drivers were installed
- Fixed issue where models would not be imported correctly with
ollama create
- Ollama will skip parsing the initial
<think>
if provided in the prompt for /api/generate by @rick-github
New Contributors
- @egyptianbman made their first contribution in #12300
- @russcoss made their first contribution in #12280
Full Changelog: v0.11.11...v0.12.0
v0.11.11
What's Changed
- Support for CUDA 13
- Improved memory usage when using gpt-oss in Ollama's app
- Better scrolling better in Ollama's app when submitting long prompts
- Cmd +/- will now zoom and shrink text in Ollama's app
- Assistant messages can now by copied in Ollama's app
- Fixed error that would occur when attempting to import satefensor files by @rick-github in #12176
- Improved memory estimates for hybrid and recurrent models by @gabe-l-hart in #12186
- Fixed error that would occur when when batch size was greater than context length
- Flash attention & KV cache quantization validation fixes by @jessegross in #12231
- Add
dimensions
field to embed requests by @mxyng in #12242 - Enable new memory estimates in Ollama's new engine by default by @jessegross in #12252
- Ollama will no longer load split vision models in the Ollama engine by @jessegross in #12241
New Contributors
- @KashyapTan made their first contribution in #12188
- @carbonatedWaterOrg made their first contribution in #12230
- @fengyuchuanshen made their first contribution in #12249
Full Changelog: v0.11.10...v0.11.11
v0.11.10
New models
- EmbeddingGemma a new open embedding model that delivers best-in-class performance for its size
What's Changed
- Support for EmbeddingGemma
Full Changelog: v0.11.9...v0.11.10
v0.11.9
What's Changed
- Improved performance via overlapping GPU and CPU computations
- Fixed issues where unrecognized AMD GPU would cause an error
- Reduce crashes due to unhandled errors in some Mac and Linux installations of Ollama
New Contributors
- @alpha-nerd-nomyo made their first contribution in #12129
- @pxwanglu made their first contribution in #12123
Full Changelog: v0.11.8...v0.11.9-rc0