8000 Add Vulkan support to ollama by whyvl · Pull Request #5059 · ollama/ollama · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add Vulkan support to ollama #5059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

Add Vulkan support to ollama #5059

wants to merge 32 commits into from

Conversation

whyvl
Copy link
@whyvl whyvl commented Jun 15, 2024

Edit: (2025/01/19)

It's been around 7 months and ollama devs don't seem to be interested in merging this PR. I'll maintain this fork as a separate project from now on. If you have any issues please raise them in the fork's repo so I can keep track of them.

This PR adds vulkan support to ollama with a proper memory monitoring implementation. This closes #2033 and replaces #2578 which does not implement proper memory monitoring.

Note that this implementation does not support GPU without VkPhysicalDeviceMemoryBudgetPropertiesEXT support. This shouldn't be a problem since on Linux the mesa driver supports it for all Intel devices afaik.

CAP_PERFMON capability is also needed for memory monitoring. This can be done by specifically enabling CAP_PERFMON when running ollama as a systemd service by adding AmbientCapabilities=CAP_PERFMON to the service or just run ollama as root.

Vulkan devices that are CPUs under the hood (e.g. llvmpipe) are also not supported. This is purposely done so to avoid accidentally using CPUs for accelerated inference. Let me know if you think this behavior should be changed.

I've not tested this on Windows nor have I implemented the logic for building ollama with Vulkan support yet because I don't use Windows. If someone can help me with this that would be great.

I've tested this on my machine with an Intel Arc A770:

System:
  Host: rofl Kernel: 6.8.11 arch: x86_64 bits: 64 compiler: gcc v: 13.2.0
  Console: pty pts/2 Distro: NixOS 24.05 (Uakari)
CPU:
  Info: 8-core (4-mt/4-st) model: Intel 0000 bits: 64 type: MST AMCP arch: Raptor Lake rev: 2
    cache: L1: 704 KiB L2: 7 MiB L3: 12 MiB
  Speed (MHz): avg: 473 high: 1100 min/max: 400/4500:3400 cores: 1: 400 2: 400 3: 400 4: 576
    5: 400 6: 400 7: 400 8: 400 9: 400 10: 400 11: 1100 12: 400 bogomips: 59904
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3
Graphics:
  Device-1: Intel DG2 [Arc A770] vendor: Acer Incorporated ALI driver: i915 v: kernel
    arch: Gen-12.7 pcie: speed: 2.5 GT/s lanes: 1 ports: active: DP-1 empty: DP-2, DP-3, DP-4,
    HDMI-A-1, HDMI-A-2, HDMI-A-3 bus-ID: 03:00.0 chip-ID: 8086:56a0
  Display: server: No display server data found. Headless machine? tty: 98x63
  Monitor-1: DP-1 model: Daewoo HDMI res: 1024x600 dpi: 55 diag: 537mm (21.1")
  API: Vulkan v: 1.3.283 surfaces: N/A device: 0 type: discrete-gpu driver: N/A
    device-ID: 8086:56a0 device: 1 type: cpu driver: N/A device-ID: 10005:0000

@whyvl whyvl mentioned this pull request Jun 15, 2024
@rasodu
Copy link
rasodu commented Jun 15, 2024

Are there any available instructions or guides that outline the steps to install Ollama from its source code on a Windows operating system? I have a Windows 10 machine equipped with an Arc A770 GPU with 8GB of memory

@whyvl
Copy link
Author
whyvl commented Jun 15, 2024

Are there any available instructions or guides that outline the steps to install Ollama from its source code on a Windows operating system? I have a Windows 10 machine equipped with an Arc A770 GPU with 8GB of memory

https://github.com/ollama/ollama/blob/main/docs/development.md

@ddpasa
Copy link
ddpasa commented Jun 15, 2024

I compiled and ran this on Linux (arch, with Intel iGPU). It seems to work as correctly, with the performance and output similar to my hacky version on #2578 .

I think we can abandon my version in favour of this (it was never meant to be merged anyway).

gpu/gpu.go Outdated
index: i,
}

C.vk_check_vram(*vHandles.vulkan, C.int(i), &memInfo)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be nice to have a debugging log here printing the amount of memory detected. (especially with iGPUs this number can be useful)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't ollama do it already? When I was debugging I saw something like Jun 15 20:25:32 rofl strace[403896]: time=2024-06-15T20:25:32.702+08:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Arc(tm) A770 Graphics (DG2)" total="15.9 GiB" available="14.3 GiB"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, but I don't that line exactly. Looks like a CAP_PERFMON thing or I messed up compilation:

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Plus Graphics (ICL GT2) | uma: 1 | fp16: 1 | warp size: 32

llama_new_context_with_model: Vulkan_Host output buffer size = 0.14 MiB
llama_new_context_with_model: Vulkan0 compute buffer size = 234.06 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB

Vulkan.time=2024-06-16T13:20:27.582+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib64/libvulkan.so.1.3.279 /usr/lib64/libcap.so.2.69=error !BADKEY="performance monitoring is not allowed. Please enable CAP_PERFMON or run as root to use Vulkan."

nvtop reveals iGPU being used as expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe run ollama as root? Or do setcap cap_perfmon=+ep /path/to/ollama

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

setcap didn't work for some reason, I still get CAP_PERFMON errors. But running with sudo gives:

time=2024-06-16T13:52:35.115+02:00 level=INFO source=gpu.go:355 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=oneapi compute="" driver=0.0 name="Intel(R) Iris(R) Plus Graphics" total="0 B" available="0 B"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Iris(R) Plus Graphics (ICL GT2)" total="11.4 GiB" available="8.4 GiB"

Copy link
Author
@whyvl whyvl Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vulkan is reporting that the device is a CPU. If it's an iGPU it should've been detected.

You mentioned the performance was similar to when you were testing your branch. Are you sure you are not using CPU inference the entire time? Can you compare the performance against a CPU runner like cpu_avx?

Copy link
Author
@whyvl whyvl Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second read nevermind. It seems like everything is working as expected. Ollama detected two Vulkan devices, one is a CPU software implementation, which is skipped according to the error message, and the last line reports a Vulkan device that is recognized by ollama, which is the actual iGPU.

Copy link
@ddpasa ddpasa Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks right. There is also a lot of oneAPI junk in the logs that confuses me. But it looks like Vulkan works as intended, but I have a CAP_PERFMON problem.

nvtop screenshot below:

2024-06-16_14-37

I wonder why setcap does not work... Could it be that one of the shared libraries (like libcap or libvulkan) needs a setcap instead of ollama binary?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the CAP_PERFMON is likely due to something off in my system. It's trying to load the 32bit library for some reason:

time=2024-06-16T14:58:06.326+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib/libvulkan.so.1.3.279 /usr/lib32/libcap.so.2.69=error !BADKEY="Unable to load /usr/lib32/libcap.so.2.69 library to query for Vulkan GPUs: /usr/lib32/libcap.so.2.69: wrong ELF class: ELFCLASS32"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading 32bit is expected, it's not related because it'll just skip it when it realizes it can't load it.

gpu/gpu_linux.go Outdated
< 6293 /summary>
}

var capLinuxGlobs = []string{
"/usr/lib/x86_64-linux-gnu/libcap.so*",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding * after /user/lib also detects 32bit libraries in the system. Not sure if you want this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this depends on the OS? I need to specify only lib64 on fedora for this to work as lib is 32bit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, regarding x86_64 specific usage, this doesn't work on aarch64 like my Raspberry Pi :)

@whyvl
Copy link
Author
whyvl commented Jul 2, 2024

@dhiltgen mind reviewing this? I'd imagine this would be pretty useful for plenty of people. Intel ARC GPUs perform faster on Vulkan than with oneapi and oneapi is still not packaged yet on NixOS. Someone has also mailed me noting how Vulkan support has let them run ollama on Polaris GPUs much faster.

@genehand
Copy link
genehand commented Jul 2, 2024

Working well here on an RX 5700 (the notorious gfx1010). Hoping I can use rocm again with 6.2, but this is a great alternative.

@Zxilly
Copy link
Zxilly commented Jul 2, 2024

I managed to run this on Windows and an AMD GPU, and if I'm successful I'll reciprocate the way I tried.

F438
@whyvl
Copy link
Author
whyvl commented Jul 2, 2024

I managed to run this on Windows and an AMD GPU, and if I'm successful I'll reciprocate the way I tried.

Interesting, I had expected that because I hadn't implemented Vulkan library loading in Windows it wouldn't have detected any Vulkan devices. Please do share how you did it.

@Zxilly
Copy link
Zxilly commented Jul 2, 2024

I'll add the corresponding code, but I'm not that familiar with vulkan and this may take time.

@ddpasa
Copy link
ddpasa commented Jul 3, 2024

@dhiltgen mind reviewing this? I'd imagine this would be pretty useful for plenty of people. Intel ARC GPUs perform faster on Vulkan than with oneapi and oneapi is still not packaged yet on NixOS. Someone has also mailed me noting how Vulkan support has let them run ollama on Polaris GPUs much faster.

Not just Arc, but it also gives nice speedups in Intel iGPUs too (Iris series).

@gioan777
Copy link
gioan777 commented Jul 3, 2024

It works perfectly on Arch Linux with my RX 6700XT as well which doesn't have official ROCm support. I did encounter a couple of hiccups while setting it up, though they're probably distro specific issues with my Arch Linux installation. I'll post the changes I made just for the record.

  • I had to do the following change to the code (git diff result):
diff --git a/llm/generate/gen_linux.sh b/llm/generate/gen_linux.sh
index 0e98e163..411e9e65 100755
--- a/llm/generate/gen_linux.sh
+++ b/llm/generate/gen_linux.sh
@@ -216,7 +216,6 @@ if [ -z "${CAP_ROOT}" ]; then
     CAP_ROOT=/usr/lib/
 fi
 
-if [ -z "${OLLAMA_SKIP_VULKAN_GENERATE}" -a -d "${VULKAN_ROOT}" ] && [ -z "${OLLAMA_SKIP_VULKAN_GENERATE}" -a -d "${CAP_ROOT}" ]; then
     echo "Vulkan and capabilities libraries detected - building dynamic Vulkan library"
     init_vars
 
@@ -232,7 +231,6 @@ if [ -z "${OLLAMA_SKIP_VULKAN_GENERATE}" -a -d "${VULKAN_ROOT}" ] && [ -z "${OLL
     cp "${VULKAN_ROOT}/libvulkan.so" "${BUILD_DIR}/bin/"
     cp "${CAP_ROOT}/libcap.so" "${BUILD_DIR}/bin/"
     compress
-fi
 
 if [ -z "${ONEAPI_ROOT}" ]; then
     # Try the default location in case it exists

otherwise Ollama wouldn't compile with Vulkan support.

  • Then I had to run the following command (./ollama is the final executable):
    sudo setcap 'cap_perfmon=ep' ./ollama
    otherwise ollama upon launching it would complain it didn't have CAP_PERFMON and couldn't use Vulkan, and then reverted to CPU only. Running ./ollama with root also solved the issue but I'm not comfortable running it with root.

@marksverdhei
Copy link

I'm going to toss a copule cents into this by emphasizing: adding vulkan support to ollama will have an environmental impact.
Hundreds of thousands of older GPUS that are still viable and cost effective are becoming e-waste in what could otherwise be repurposed electronics for an ever-growing trend in local LLMs. Adding vulkan (or any, more universal GPU backend) support means more than you might think, as I believe ollama is the most popular local inference technology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the decisions made in this project will have an actual impact in the value of existing electronics and an environmental impact by consequence.

@valonmamudi
Copy link

ology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the deci

++

I would even be able to use my iGPU on my Laptop without any special drivers what not.

And please let's not forget the performance increase, with an token inference increase of up to 100% it should actually be a nobrainer.

@kth8
Copy link
kth8 commented Apr 29, 2025

I'm going to toss a copule cents into this by emphasizing: adding vulkan support to ollama will have an environmental impact. Hundreds of thousands of older GPUS that are still viable and cost effective are becoming e-waste in what could otherwise be repurposed electronics for an ever-growing trend in local LLMs. Adding vulkan (or any, more universal GPU backend) support means more than you might think, as I believe ollama is the most popular local inference technology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the decisions made in this project will have an actual impact in the value of existing electronics and an environmental impact by consequence.

I had written off an old ETH mining rig I built almost a decade ago exactly as e-waste until I discovered Vulkan backend after failing to get ROCm to work which motivated me to create these repos to bring new life back into these hardware:
https://github.com/kth8/llama-server-vulkan
https://github.com/kth8/whisper-server-vulkan

@valonmamudi
Copy link

I'm going to toss a copule cents into this by emphasizing: adding vulkan support to ollama will have an environmental impact. Hundreds of thousands of older GPUS that are still viable and cost effective are becoming e-waste in what could otherwise be repurposed electronics for an ever-growing trend in local LLMs. Adding vulkan (or any, more universal GPU backend) support means more than you might think, as I believe ollama is the most popular local inference technology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the decisions made in this project will have an actual impact in the value of existing electronics and an environmental impact by consequence.

I had written off an old ETH mining rig I built almost a decade ago exactly as e-waste until I discovered Vulkan backend after failing to get ROCm to work which motivated me to create these repos to bring new life back into these hardware: https://github.com/kth8/llama-server-vulkan https://github.com/kth8/whisper-server-vulkan

The more I see such community efforts/solutions, the more I think Vulkan should have been added to Ollama long time ago.

@ddpasa
Copy link
ddpasa commented Apr 29, 2025

For those who want to use Vulkan, I recommend using llama.cpp server directly. Ollama team clearly does not care about this.

@truppelito
Copy link

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

@ddpasa
Copy link
ddpasa commented Apr 29, 2025

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

I have no idea. I have given up on Ollama. llama.cpp just works.

@mathstuf
Copy link

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

Somewhere an ollama developer mentioned that supporting some other feature is more important than new backends.

llama.cpp just works.

One of the nice things (for me) about ollama is the easy model pulling and management; does llama.cpp have that?

@kth8
Copy link
kth8 commented Apr 29, 2025

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

Somewhere an ollama developer mentioned that supporting some other feature is more important than new backends.

llama.cpp just works.

One of the nice things (for me) about ollama is the easy model pulling and management; does llama.cpp have that?

No, llama.cpp is just an inference engine so you need manage the models yourself.

@khumarahn
Copy link

No, llama.cpp is just an inference engine so you need manage the models yourself.

That's not quite correct. llama.cpp can pull models from huggingface

@kth8
Copy link
kth8 commented Apr 29, 2025

it can download models from HF but to manage them like list what you have already downloaded, check their sizes, removing them, etc you need to perform those actions manually

@khumarahn
Copy link

it can download models from HF but to manage them like list what you have already downloaded, check their sizes, removing them, etc you need to perform those actions manually

ncdu ~/.cache/llama.cpp/

@marksverdhei
Copy link

Ollama provides better ux, which is likely what makes it the most popular inference server. People say just use llama.cpp. It definitely raises the bar for people that are just getting into local llms, even if dabbling in the .cache dir might seem trivial to you. ux/dx matters. Also why Vulkan support here matters.

@jim3692
Copy link
jim3692 commented May 1, 2025

Ollama provides better ux, which is likely what makes it the most popular inference server. People say just use llama.cpp. It definitely raises the bar for people that are just getting into local llms, even if dabbling in the .cache dir might seem trivial to you. ux/dx matters. Also why Vulkan support here matters.

You can use ramalama, which is a llama.cpp wrapper

@lirc572
Copy link
lirc572 commented May 1, 2025

You can use ramalama, which is a llama.cpp wrapper

I really like ramalama's idea of using simple containerized wrappers over established frameworks. However my experience with ramalama has not been smooth as every update breaks something from the previous version on my machine since 2 months ago (the latest release even automatically removed all my running local models). It's understandable as it is still in version 0 and unstable, but for this reason it probably should not be considered for any purpose other than basic testing atm. Look forward to seeing it grow though.

@khumarahn
Copy link

Ollama provides better ux, which is likely what makes it the most popular inference server. People say just use llama.cpp. It definitely raises the bar f 47C5 or people that are just getting into local llms, even if dabbling in the .cache dir might seem trivial to you. ux/dx matters. Also why Vulkan support here matters.

In my experience (cpu only, weak hardware), ollama was hardly... hardly usable at all. It uses a modified version of llama.cpp over which I don't have any control. On longer queries, ollama would frequently crash without clear error messages... Using llama.cpp directly is way easier: it just works, it gives much finer control over options, supports more models, does not require docker/podman, shows better log messages. Managing downloaded models is trivial: they are stored in a folder with human readable names.

Ollama may provide a better ux for standard use, when it works, but whenever you need something a little bit unusual, it sucks

@engelmi
Copy link
engelmi commented May 2, 2025

I really like ramalama's idea of using simple containerized wrappers over established frameworks. However my experience with ramalama has not been smooth as every update breaks something from the previous version on my machine since 2 months ago (the latest release even automatically removed all my running local models). It's understandable as it is still in version 0 and unstable, but for this reason it probably should not be considered for any purpose other than basic testing atm. Look forward to seeing it grow though.

Yes, we are currently operating in a "move fast and break things"-mode, so we frequently include breaking changes to further improve ramalama such as the recent model store at the moment. Things should stabilize soon, though... stay tuned! :)

(the latest release even automatically removed all my running local models).

This is due to the migration from the old to the new storage - the local models haven't been deleted, but moved to a different location. It might cause issues upgrading ramalama while local models are running. Re-running the models should work, though. I am sorry for the inconvenience this causes and it is a one-time thing!

@kth8
Copy link
kth8 commented May 2, 2025

@engelmi I did a bit of testing recently with RamaLama on a base M1 MBA running a 1B model. On bare metal the results were acceptable.

model size params backend threads test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Metal,BLAS 7438 4 pp512 1031.18 ± 3.76
gemma3 1B Q4_0 680.82 MiB 999.89 M Metal,BLAS 4 tg128 56.99 ± 0.20

Running inside a GPU accelerated VM using libkrun with quay.io/ramalama/ramalama:0.7 image and Kompute the prompt processing speed got destroyed.

model size params backend ngl test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Kompute 99 pp512 61.16 ± 0.04
gemma3 1B Q4_0 680.82 MiB 999.89 M Kompute 99 tg128 20.05 ± 0.10

I saw there used to be a quay.io/ramalama/vulkan image so I'm wondering why that backend is not available anymore in favor of Kompose? This is what I got using llama.cpp compiled with Vulkan.

model size params backend ngl test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Vulkan 99 pp512 626.57 ± 0.51
gemma3 1B Q4_0 680.82 MiB 999.89 M Vulkan 99 tg128 23.61 ± 0.03

@ericcurtin
Copy link
Contributor

@engelmi I did a bit of testing recently with RamaLama on a base M1 MBA running a 1B model. On bare metal the results were acceptable.

model size params backend threads test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Metal,BLAS 4 pp512 1031.18 ± 3.76
gemma3 1B Q4_0 680.82 MiB 999.89 M Metal,BLAS 4 tg128 56.99 ± 0.20
Running inside a GPU accelerated VM using libkrun with quay.io/ramalama/ramalama:0.7 image and Kompute the prompt processing speed got destroyed.

model size params backend ngl test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Kompute 99 pp512 61.16 ± 0.04
gemma3 1B Q4_0 680.82 MiB 999.89 M Kompute 99 tg128 20.05 ± 0.10
I saw there used to be a quay.io/ramalama/vulkan image so I'm wondering why that backend is not available anymore in favor of Kompose? This is what I got using llama.cpp compiled with Vulkan.

model size params backend ngl test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Vulkan 99 pp512 626.57 ± 0.51
gemma3 1B Q4_0 680.82 MiB 999.89 M Vulkan 99 tg128 23.61 ± 0.03

This is known, we are eliminating kompute soon and just focusing on "Vulkan" implementation of Vulkan. The thing is Vulkan has to go through a moltenvk compatibility layer (which translates vulkan to metal) so it's hard to beat metal direct (which is only a macOS thing). Luckily in RamaLama we give you both options. @kpouget is the RamaLama perf guru

@builker
Copy link
builker commented Jun 3, 2025

I test ollama 0.9.0-vulkan, this PR built with the latest patches in the EDIT - 20250530_0915 using my own Dockfile.vulkan. The test host, running Ubuntu 22.04 too, is an older PC and it has two 2 GPUs: Nvidia Quadro M620 (2GB VRAM) and Intel HD Graphics 630. It reveals some memory issues.

Try to run vulkan on Quadro M620: docker run -d --cap-add=PERFMON --gpus all -v dot_ollama:/root/.ollama -p 11434:11434 -e GGML_VK_VISIBLE_DEVICES=0 -e OLLAMA_DEBUG=1 --hostname=ollama --rm --name ollama local/ollama:vulkan-v0.9.0

root@ollama:/# vulkaninfo | grep "GPU id"
GPU id = 0 (Quadro M620)
GPU id = 1 (llvmpipe (LLVM 15.0.7, 256 bits))
root@ollama:/# ollama run qwen3:4b-q4_K_M --verbose
Error: llama runner process has terminated: exit status 2

From docker logs ollama:

load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  35 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device CPU, is_swa = 0
load_tensors: offloaded 21/37 layers to GPU
ggml_vulkan: Device memory allocation of size 635473408 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 635473408

The vanilla ollama 0.9.0 can offload 13 layers to CUDA0 by default, 15 layers manually /set parameter num_gpu 15, and encounter CUDA error: out of memory when attempting 16 layers.

I also try to run the model on both GPUs together: docker run -d --cap-add=PERFMON --gpus all --device /dev/dri -v dot_ollama:/root/.ollama -p 11434:11434 -e GGML_VK_VISIBLE_DEVICES=0,1 -e OLLAMA_DEBUG=1 --hostname=ollama --rm --name ollama local/ollama:vulkan-v0.9.0. It doesn't offload the model to the 2 GPUs as I hoped because it tried to offload to Vulkan0 and Vulkan1, two instances of Quadro M620 in a separate run. Only to one GPU at a time, without estimating the number of layers.

root@ollama:/# vulkaninfo | grep "GPU id"
            GPU id = 0 (Quadro M620)
            GPU id = 1 (Intel(R) HD Graphics 630 (KBL GT2))
            GPU id = 2 (llvmpipe (LLVM 15.0.7, 256 bits))
root@ollama:/# ollama run qwen3:4b-q4_K_M --verbose
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) HD Graphics 630 (KBL GT2)) - 23981 MiB free
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device Vulkan0, is_swa = 0
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
>>> who are you?
[completed slowly without errors]

>>> /set parameter main_gpu 1
>>> who are you?
Error: llama runner process has terminated: error loading model: unable to allocate Vulkan0 buffer

From docker logs ollama:

llama_model_load_from_file_impl: failed to load model
llama_model_load_from_file_impl: using device Vulkan0 (Quadro M620) - 2048 MiB free
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device Vulkan0, is_swa = 0
ggml_vulkan: Device memory allocation of size 1068049408 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan0 buffer of size 1068049408
ggml_vulkan memory: ggml_backend_vk_buffer_free_buffer()
ggml_vulkan memory: Vulkan0: -1020.73 MiB device at 0x784aaa689be0. Total device: 0 B, total host: 0 B

It switches to Quadro M620 from HD Graphics 630, but attempts 37 layers without even the over-estimated 21 in the first case. If I manually /set parameter num_gpu 15 after /set parameter main_gpu 1, it completes the generation faster running on Quadro M620 partially without errors.

@builker
Copy link
builker commented Jun 5, 2025

Is there any coordination and priority among device requests via the environment variables CUDA_VISIBLE_DEVICES:-1 GGML_VK_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: ? Where is it implemented?

In envconfig/config.go:

var (
LLMLibrary = String("OLLAMA_LLM_LIBRARY")

    CudaVisibleDevices    = String("CUDA_VISIBLE_DEVICES")
    HipVisibleDevices     = String("HIP_VISIBLE_DEVICES")
    RocrVisibleDevices    = String("ROCR_VISIBLE_DEVICES")
    VkVisibleDevices      = String("GGML_VK_VISIBLE_DEVICES")
    GpuDeviceOrdinal      = String("GPU_DEVICE_ORDINAL")
    HsaOverrideGfxVersion = String("HSA_OVERRIDE_GFX_VERSION")

)

In ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp:

size_t num_available_devices = vk_instance.instance.enumeratePhysicalDevices().size();

// Emulate behavior of CUDA_VISIBLE_DEVICES for Vulkan
char * devices_env = getenv("GGML_VK_VISIBLE_DEVICES");
if (devices_env != nullptr) {
    std::string devices(devices_env);
    std::replace(devices.begin(), devices.end(), ',', ' ');

    std::stringstream ss(devices);
    size_t tmp;

@grinco
Copy link
grinco commented Jun 6, 2025

Is there any coordination and priority among device requests via the environment variables CUDA_VISIBLE_DEVICES:-1 GGML_VK_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: ? Where is it implemented?

In envconfig/config.go:

var (
LLMLibrary = String("OLLAMA_LLM_LIBRARY")

    CudaVisibleDevices    = String("CUDA_VISIBLE_DEVICES")
    HipVisibleDevices     = String("HIP_VISIBLE_DEVICES")
    RocrVisibleDevices    = String("ROCR_VISIBLE_DEVICES")
    VkVisibleDevices      = String("GGML_VK_VISIBLE_DEVICES")
    GpuDeviceOrdinal      = String("GPU_DEVICE_ORDINAL")
    HsaOverrideGfxVersion = String("HSA_OVERRIDE_GFX_VERSION")

)

In ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp:

size_t num_available_devices = vk_instance.instance.enumeratePhysicalDevices().size();

// Emulate behavior of CUDA_VISIBLE_DEVICES for Vulkan
char * devices_env = getenv("GGML_VK_VISIBLE_DEVICES");
if (devices_env != nullptr) {
    std::string devices(devices_env);
    std::replace(devices.begin(), devices.end(), ',', ' ');

    std::stringstream ss(devices);
    size_t tmp;

See whyvl#22 and whyvl#7 (comment) some of it is already in #9650 - I'll see if I'll have time in the next few weeks to attempt to merge the latest 0.9.0 code.

@ddpasa
Copy link
ddpasa commented Jun 7, 2025

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

@Split7fire
Copy link

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

What gpu you working with?
I want to use RX5700XT but llama-cli --list-devices shows nothing, despite vulkaninfo list gpu correctly. I installed llama-cpp from brew on Aurora (Fedora kinoite).

Any help appreciated.

@kth8
Copy link
kth8 commented Jun 8, 2025

@Split7fire You need to compile llama.cpp yourself with the -D GGML_VULKAN=ON flag. llama.cpp from Homebrew has not been compiled with that https://github.com/Homebrew/homebrew-core/blob/831392863f0b47c58cd1db68bc85db2ab48b9ed9/Formula/l/llama.cpp.rb

@builker
Copy link
builker commented Jun 9, 2025

@Split7fire I change -DGGML_VULKAN=1 to -DGGML_VULKAN=ON in vulkan.Dockerfile. They are probably the same.

@linuxd3v
Copy link
linuxd3v commented Jun 9, 2025

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

I use self-compiled llama.cpp with vulkan, that works but switching models is such a chore on llama.cpp.
And then I have to maintain separate systemd files just to have different .gguf models loaded. just cumbersome.
also how do you update models? ollma at least has ollama pull built in, with l 10000 lama.cpp I have to find the model and use manual aria2c or wget2 to download.

shame. I really could use vulkan support in ollama. ai max 395 doesn't work with rocm well yet.

@ericcurtin
Copy link
Contributor

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

I use self-compiled llama.cpp with vulkan, that works but switching models is such a chore. on llama.cpp. And then I have to maintain separate systemd files just to have different .gguf models loaded. just cumbersome. also how do you update models? ollma at least has ollama pull built in, with llama.ccp I have to use manual aria2c or wget2.

shame. I really could use vulkan support in ollama. ai max 395 doesn't work with rocm well yet.

This is an upcoming feature in ramalama with vulkan, stay tuned...

@kth8
Copy link
kth8 commented Jun 9, 2025

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

I use self-compiled llama.cpp with vulkan, that works but switching models is such a chore on llama.cpp. And then I have to maintain separate systemd files just to have different .gguf models loaded. just cumbersome. also how do you update models? ollma at least has ollama pull built in, with llama.cpp I have to find the model and use manual aria2c or wget2 to download.

shame. I really could use vulkan support in ollama. ai max 395 doesn't work with rocm well yet.

llama-swap can hot swap models with llama.cpp https://github.com/mostlygeek/llama-swap

@lnicola
Copy link
lnicola commented Jun 10, 2025

And then I have to maintain separate systemd files just to have different .gguf models loaded.

You might be able to use %I for the model in your unit, then e.g. systemctl start llama@ggml-org/gemma-3-1b-it-GGUF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Vulkan runner
0