Add Vulkan support to ollama #5059

whyvl · 2024-06-15T10:13:25Z

Edit: (2025/01/19)

It's been around 7 months and ollama devs don't seem to be interested in merging this PR. I'll maintain this fork as a separate project from now on. If you have any issues please raise them in the fork's repo so I can keep track of them.

This PR adds vulkan support to ollama with a proper memory monitoring implementation. This closes #2033 and replaces #2578 which does not implement proper memory monitoring.

Note that this implementation does not support GPU without VkPhysicalDeviceMemoryBudgetPropertiesEXT support. This shouldn't be a problem since on Linux the mesa driver supports it for all Intel devices afaik.

CAP_PERFMON capability is also needed for memory monitoring. This can be done by specifically enabling CAP_PERFMON when running ollama as a systemd service by adding AmbientCapabilities=CAP_PERFMON to the service or just run ollama as root.

Vulkan devices that are CPUs under the hood (e.g. llvmpipe) are also not supported. This is purposely done so to avoid accidentally using CPUs for accelerated inference. Let me know if you think this behavior should be changed.

I've not tested this on Windows nor have I implemented the logic for building ollama with Vulkan support yet because I don't use Windows. If someone can help me with this that would be great.

I've tested this on my machine with an Intel Arc A770:

System:
  Host: rofl Kernel: 6.8.11 arch: x86_64 bits: 64 compiler: gcc v: 13.2.0
  Console: pty pts/2 Distro: NixOS 24.05 (Uakari)
CPU:
  Info: 8-core (4-mt/4-st) model: Intel 0000 bits: 64 type: MST AMCP arch: Raptor Lake rev: 2
    cache: L1: 704 KiB L2: 7 MiB L3: 12 MiB
  Speed (MHz): avg: 473 high: 1100 min/max: 400/4500:3400 cores: 1: 400 2: 400 3: 400 4: 576
    5: 400 6: 400 7: 400 8: 400 9: 400 10: 400 11: 1100 12: 400 bogomips: 59904
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3
Graphics:
  Device-1: Intel DG2 [Arc A770] vendor: Acer Incorporated ALI driver: i915 v: kernel
    arch: Gen-12.7 pcie: speed: 2.5 GT/s lanes: 1 ports: active: DP-1 empty: DP-2, DP-3, DP-4,
    HDMI-A-1, HDMI-A-2, HDMI-A-3 bus-ID: 03:00.0 chip-ID: 8086:56a0
  Display: server: No display server data found. Headless machine? tty: 98x63
  Monitor-1: DP-1 model: Daewoo HDMI res: 1024x600 dpi: 55 diag: 537mm (21.1")
  API: Vulkan v: 1.3.283 surfaces: N/A device: 0 type: discrete-gpu driver: N/A
    device-ID: 8086:56a0 device: 1 type: cpu driver: N/A device-ID: 10005:0000

# Conflicts: # gpu/gpu.go

rasodu · 2024-06-15T12:57:51Z

Are there any available instructions or guides that outline the steps to install Ollama from its source code on a Windows operating system? I have a Windows 10 machine equipped with an Arc A770 GPU with 8GB of memory

whyvl · 2024-06-15T13:08:42Z

Are there any available instructions or guides that outline the steps to install Ollama from its source code on a Windows operating system? I have a Windows 10 machine equipped with an Arc A770 GPU with 8GB of memory

https://github.com/ollama/ollama/blob/main/docs/development.md

ddpasa · 2024-06-15T18:10:37Z

I compiled and ran this on Linux (arch, with Intel iGPU). It seems to work as correctly, with the performance and output similar to my hacky version on #2578 .

I think we can abandon my version in favour of this (it was never meant to be merged anyway).

ddpasa · 2024-06-15T18:11:56Z

gpu/gpu.go

+					index: i,
+				}
+
+				C.vk_check_vram(*vHandles.vulkan, C.int(i), &memInfo)


it could be nice to have a debugging log here printing the amount of memory detected. (especially with iGPUs this number can be useful)

Doesn't ollama do it already? When I was debugging I saw something like Jun 15 20:25:32 rofl strace[403896]: time=2024-06-15T20:25:32.702+08:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Arc(tm) A770 Graphics (DG2)" total="15.9 GiB" available="14.3 GiB"

I think you're right, but I don't that line exactly. Looks like a CAP_PERFMON thing or I messed up compilation:

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Plus Graphics (ICL GT2) | uma: 1 | fp16: 1 | warp size: 32

llama_new_context_with_model: Vulkan_Host output buffer size = 0.14 MiB
llama_new_context_with_model: Vulkan0 compute buffer size = 234.06 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB

Vulkan.time=2024-06-16T13:20:27.582+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib64/libvulkan.so.1.3.279 /usr/lib64/libcap.so.2.69=error !BADKEY="performance monitoring is not allowed. Please enable CAP_PERFMON or run as root to use Vulkan."

nvtop reveals iGPU being used as expected.

Maybe run ollama as root? Or do setcap cap_perfmon=+ep /path/to/ollama

thanks

setcap didn't work for some reason, I still get CAP_PERFMON errors. But running with sudo gives:

time=2024-06-16T13:52:35.115+02:00 level=INFO source=gpu.go:355 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=oneapi compute="" driver=0.0 name="Intel(R) Iris(R) Plus Graphics" total="0 B" available="0 B"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Iris(R) Plus Graphics (ICL GT2)" total="11.4 GiB" available="8.4 GiB"

~~Vulkan is reporting that the device is a CPU. If it's an iGPU it should've been detected.~~

You mentioned the performance was similar to when you were testing your branch. Are you sure you are not using CPU inference the entire time? Can you compare the performance against a CPU runner like cpu_avx?

On a second read nevermind. It seems like everything is working as expected. Ollama detected two Vulkan devices, one is a CPU software implementation, which is skipped according to the error message, and the last line reports a Vulkan device that is recognized by ollama, which is the actual iGPU.

Yes, that looks right. There is also a lot of oneAPI junk in the logs that confuses me. But it looks like Vulkan works as intended, but I have a CAP_PERFMON problem.

nvtop screenshot below:

I wonder why setcap does not work... Could it be that one of the shared libraries (like libcap or libvulkan) needs a setcap instead of ollama binary?

ok, the CAP_PERFMON is likely due to something off in my system. It's trying to load the 32bit library for some reason:

time=2024-06-16T14:58:06.326+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib/libvulkan.so.1.3.279 /usr/lib32/libcap.so.2.69=error !BADKEY="Unable to load /usr/lib32/libcap.so.2.69 library to query for Vulkan GPUs: /usr/lib32/libcap.so.2.69: wrong ELF class: ELFCLASS32"

Loading 32bit is expected, it's not related because it'll just skip it when it realizes it can't load it.

ddpasa · 2024-06-16T13:24:19Z

gpu/gpu_linux.go

+}
+
+var capLinuxGlobs = []string{
+	"/usr/lib/x86_64-linux-gnu/libcap.so*",


adding * after /user/lib also detects 32bit libraries in the system. Not sure if you want this.

I suppose this depends on the OS? I need to specify only lib64 on fedora for this to work as lib is 32bit.

Same comment as above, regarding x86_64 specific usage, this doesn't work on aarch64 like my Raspberry Pi :)

whyvl · 2024-07-02T09:24:12Z

@dhiltgen mind reviewing this? I'd imagine this would be pretty useful for plenty of people. Intel ARC GPUs perform faster on Vulkan than with oneapi and oneapi is still not packaged yet on NixOS. Someone has also mailed me noting how Vulkan support has let them run ollama on Polaris GPUs much faster.

genehand · 2024-07-02T15:11:24Z

Working well here on an RX 5700 (the notorious gfx1010). Hoping I can use rocm again with 6.2, but this is a great alternative.

Zxilly · 2024-07-02T18:12:26Z

I managed to run this on Windows and an AMD GPU, and if I'm successful I'll reciprocate the way I tried.

whyvl · 2024-07-02T18:14:25Z

I managed to run this on Windows and an AMD GPU, and if I'm successful I'll reciprocate the way I tried.

Interesting, I had expected that because I hadn't implemented Vulkan library loading in Windows it wouldn't have detected any Vulkan devices. Please do share how you did it.

Zxilly · 2024-07-02T18:16:53Z

I'll add the corresponding code, but I'm not that familiar with vulkan and this may take time.

ddpasa · 2024-07-03T11:48:00Z

@dhiltgen mind reviewing this? I'd imagine this would be pretty useful for plenty of people. Intel ARC GPUs perform faster on Vulkan than with oneapi and oneapi is still not packaged yet on NixOS. Someone has also mailed me noting how Vulkan support has let them run ollama on Polaris GPUs much faster.

Not just Arc, but it also gives nice speedups in Intel iGPUs too (Iris series).

gioan777 · 2024-07-03T23:47:45Z

It works perfectly on Arch Linux with my RX 6700XT as well which doesn't have official ROCm support. I did encounter a couple of hiccups while setting it up, though they're probably distro specific issues with my Arch Linux installation. I'll post the changes I made just for the record.

I had to do the following change to the code (git diff result):

diff --git a/llm/generate/gen_linux.sh b/llm/generate/gen_linux.sh
index 0e98e163..411e9e65 100755
--- a/llm/generate/gen_linux.sh
+++ b/llm/generate/gen_linux.sh
@@ -216,7 +216,6 @@ if [ -z "${CAP_ROOT}" ]; then
     CAP_ROOT=/usr/lib/
 fi
 
-if [ -z "${OLLAMA_SKIP_VULKAN_GENERATE}" -a -d "${VULKAN_ROOT}" ] && [ -z "${OLLAMA_SKIP_VULKAN_GENERATE}" -a -d "${CAP_ROOT}" ]; then
     echo "Vulkan and capabilities libraries detected - building dynamic Vulkan library"
     init_vars
 
@@ -232,7 +231,6 @@ if [ -z "${OLLAMA_SKIP_VULKAN_GENERATE}" -a -d "${VULKAN_ROOT}" ] && [ -z "${OLL
     cp "${VULKAN_ROOT}/libvulkan.so" "${BUILD_DIR}/bin/"
     cp "${CAP_ROOT}/libcap.so" "${BUILD_DIR}/bin/"
     compress
-fi
 
 if [ -z "${ONEAPI_ROOT}" ]; then
     # Try the default location in case it exists

otherwise Ollama wouldn't compile with Vulkan support.

Then I had to run the following command (./ollama is the final executable):
sudo setcap 'cap_perfmon=ep' ./ollama
otherwise ollama upon launching it would complain it didn't have CAP_PERFMON and couldn't use Vulkan, and then reverted to CPU only. Running ./ollama with root also solved the issue but I'm not comfortable running it with root.

marksverdhei · 2025-04-29T08:37:41Z

I'm going to toss a copule cents into this by emphasizing: adding vulkan support to ollama will have an environmental impact.
Hundreds of thousands of older GPUS that are still viable and cost effective are becoming e-waste in what could otherwise be repurposed electronics for an ever-growing trend in local LLMs. Adding vulkan (or any, more universal GPU backend) support means more than you might think, as I believe ollama is the most popular local inference technology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the decisions made in this project will have an actual impact in the value of existing electronics and an environmental impact by consequence.

valonmamudi · 2025-04-29T09:00:22Z

ology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the deci

++

I would even be able to use my iGPU on my Laptop without any special drivers what not.

And please let's not forget the performance increase, with an token inference increase of up to 100% it should actually be a nobrainer.

kth8 · 2025-04-29T09:10:30Z

I'm going to toss a copule cents into this by emphasizing: adding vulkan support to ollama will have an environmental impact. Hundreds of thousands of older GPUS that are still viable and cost effective are becoming e-waste in what could otherwise be repurposed electronics for an ever-growing trend in local LLMs. Adding vulkan (or any, more universal GPU backend) support means more than you might think, as I believe ollama is the most popular local inference technology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the decisions made in this project will have an actual impact in the value of existing electronics and an environmental impact by consequence.

I had written off an old ETH mining rig I built almost a decade ago exactly as e-waste until I discovered Vulkan backend after failing to get ROCm to work which motivated me to create these repos to bring new life back into these hardware:
https://github.com/kth8/llama-server-vulkan
https://github.com/kth8/whisper-server-vulkan

valonmamudi · 2025-04-29T13:57:51Z

I'm going to toss a copule cents into this by emphasizing: adding vulkan support to ollama will have an environmental impact. Hundreds of thousands of older GPUS that are still viable and cost effective are becoming e-waste in what could otherwise be repurposed electronics for an ever-growing trend in local LLMs. Adding vulkan (or any, more universal GPU backend) support means more than you might think, as I believe ollama is the most popular local inference technology for self hosted llms in the world. I don't mean to be overly grandiose about this but I'm just saying that the decisions made in this project will have an actual impact in the value of existing electronics and an environmental impact by consequence.

I had written off an old ETH mining rig I built almost a decade ago exactly as e-waste until I discovered Vulkan backend after failing to get ROCm to work which motivated me to create these repos to bring new life back into these hardware: https://github.com/kth8/llama-server-vulkan https://github.com/kth8/whisper-server-vulkan

The more I see such community efforts/solutions, the more I think Vulkan should have been added to Ollama long time ago.

ddpasa · 2025-04-29T14:50:56Z

For those who want to use Vulkan, I recommend using llama.cpp server directly. Ollama team clearly does not care about this.

truppelito · 2025-04-29T14:53:59Z

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

ddpasa · 2025-04-29T14:57:57Z

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

I have no idea. I have given up on Ollama. llama.cpp just works.

mathstuf · 2025-04-29T17:24:38Z

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

Somewhere an ollama developer mentioned that supporting some other feature is more important than new backends.

llama.cpp just works.

One of the nice things (for me) about ollama is the easy model pulling and management; does llama.cpp have that?

kth8 · 2025-04-29T18:09:42Z

I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this?

Somewhere an ollama developer mentioned that supporting some other feature is more important than new backends.

llama.cpp just works.

One of the nice things (for me) about ollama is the easy model pulling and management; does llama.cpp have that?

No, llama.cpp is just an inference engine so you need manage the models yourself.

khumarahn · 2025-04-29T18:14:25Z

No, llama.cpp is just an inference engine so you need manage the models yourself.

That's not quite correct. llama.cpp can pull models from huggingface

kth8 · 2025-04-29T18:36:57Z

it can download models from HF but to manage them like list what you have already downloaded, check their sizes, removing them, etc you need to perform those actions manually

khumarahn · 2025-05-01T11:14:20Z

it can download models from HF but to manage them like list what you have already downloaded, check their sizes, removing them, etc you need to perform those actions manually

ncdu ~/.cache/llama.cpp/

marksverdhei · 2025-05-01T11:23:54Z

Ollama provides better ux, which is likely what makes it the most popular inference server. People say just use llama.cpp. It definitely raises the bar for people that are just getting into local llms, even if dabbling in the .cache dir might seem trivial to you. ux/dx matters. Also why Vulkan support here matters.

jim3692 · 2025-05-01T11:54:59Z

Ollama provides better ux, which is likely what makes it the most popular inference server. People say just use llama.cpp. It definitely raises the bar for people that are just getting into local llms, even if dabbling in the .cache dir might seem trivial to you. ux/dx matters. Also why Vulkan support here matters.

You can use ramalama, which is a llama.cpp wrapper

lirc572 · 2025-05-01T13:12:57Z

You can use ramalama, which is a llama.cpp wrapper

I really like ramalama's idea of using simple containerized wrappers over established frameworks. However my experience with ramalama has not been smooth as every update breaks something from the previous version on my machine since 2 months ago (the latest release even automatically removed all my running local models). It's understandable as it is still in version 0 and unstable, but for this reason it probably should not be considered for any purpose other than basic testing atm. Look forward to seeing it grow though.

khumarahn · 2025-05-02T09:36:03Z

Ollama provides better ux, which is likely what makes it the most popular inference server. People say just use llama.cpp. It definitely raises the bar f 47C5 or people that are just getting into local llms, even if dabbling in the .cache dir might seem trivial to you. ux/dx matters. Also why Vulkan support here matters.

In my experience (cpu only, weak hardware), ollama was hardly... hardly usable at all. It uses a modified version of llama.cpp over which I don't have any control. On longer queries, ollama would frequently crash without clear error messages... Using llama.cpp directly is way easier: it just works, it gives much finer control over options, supports more models, does not require docker/podman, shows better log messages. Managing downloaded models is trivial: they are stored in a folder with human readable names.

Ollama may provide a better ux for standard use, when it works, but whenever you need something a little bit unusual, it sucks

engelmi · 2025-05-02T11:29:16Z

I really like ramalama's idea of using simple containerized wrappers over established frameworks. However my experience with ramalama has not been smooth as every update breaks something from the previous version on my machine since 2 months ago (the latest release even automatically removed all my running local models). It's understandable as it is still in version 0 and unstable, but for this reason it probably should not be considered for any purpose other than basic testing atm. Look forward to seeing it grow though.

Yes, we are currently operating in a "move fast and break things"-mode, so we frequently include breaking changes to further improve ramalama such as the recent model store at the moment. Things should stabilize soon, though... stay tuned! :)

(the latest release even automatically removed all my running local models).

This is due to the migration from the old to the new storage - the local models haven't been deleted, but moved to a different location. It might cause issues upgrading ramalama while local models are running. Re-running the models should work, though. I am sorry for the inconvenience this causes and it is a one-time thing!

kth8 · 2025-05-02T13:03:53Z

@engelmi I did a bit of testing recently with RamaLama on a base M1 MBA running a 1B model. On bare metal the results were acceptable.

model	size	params	backend	threads	test	t/s
gemma3 1B Q4_0	680.82 MiB	999.89 M	Metal,BLAS 7438	4	pp512	1031.18 ± 3.76
gemma3 1B Q4_0	680.82 MiB	999.89 M	Metal,BLAS	4	tg128	56.99 ± 0.20

Running inside a GPU accelerated VM using libkrun with quay.io/ramalama/ramalama:0.7 image and Kompute the prompt processing speed got destroyed.

model	size	params	backend	ngl	test	t/s
gemma3 1B Q4_0	680.82 MiB	999.89 M	Kompute	99	pp512	61.16 ± 0.04
gemma3 1B Q4_0	680.82 MiB	999.89 M	Kompute	99	tg128	20.05 ± 0.10

I saw there used to be a quay.io/ramalama/vulkan image so I'm wondering why that backend is not available anymore in favor of Kompose? This is what I got using llama.cpp compiled with Vulkan.

model	size	params	backend	ngl	test	t/s
gemma3 1B Q4_0	680.82 MiB	999.89 M	Vulkan	99	pp512	626.57 ± 0.51
gemma3 1B Q4_0	680.82 MiB	999.89 M	Vulkan	99	tg128	23.61 ± 0.03

ericcurtin · 2025-05-02T14:59:37Z

@engelmi I did a bit of testing recently with RamaLama on a base M1 MBA running a 1B model. On bare metal the results were acceptable.

model size params backend threads test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Metal,BLAS 4 pp512 1031.18 ± 3.76
gemma3 1B Q4_0 680.82 MiB 999.89 M Metal,BLAS 4 tg128 56.99 ± 0.20
Running inside a GPU accelerated VM using libkrun with quay.io/ramalama/ramalama:0.7 image and Kompute the prompt processing speed got destroyed.

model size params backend ngl test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Kompute 99 pp512 61.16 ± 0.04
gemma3 1B Q4_0 680.82 MiB 999.89 M Kompute 99 tg128 20.05 ± 0.10
I saw there used to be a quay.io/ramalama/vulkan image so I'm wondering why that backend is not available anymore in favor of Kompose? This is what I got using llama.cpp compiled with Vulkan.

model size params backend ngl test t/s
gemma3 1B Q4_0 680.82 MiB 999.89 M Vulkan 99 pp512 626.57 ± 0.51
gemma3 1B Q4_0 680.82 MiB 999.89 M Vulkan 99 tg128 23.61 ± 0.03

This is known, we are eliminating kompute soon and just focusing on "Vulkan" implementation of Vulkan. The thing is Vulkan has to go through a moltenvk compatibility layer (which translates vulkan to metal) so it's hard to beat metal direct (which is only a macOS thing). Luckily in RamaLama we give you both options. @kpouget is the RamaLama perf guru

builker · 2025-06-03T04:31:15Z

I test ollama 0.9.0-vulkan, this PR built with the latest patches in the EDIT - 20250530_0915 using my own Dockfile.vulkan. The test host, running Ubuntu 22.04 too, is an older PC and it has two 2 GPUs: Nvidia Quadro M620 (2GB VRAM) and Intel HD Graphics 630. It reveals some memory issues.

Try to run vulkan on Quadro M620: docker run -d --cap-add=PERFMON --gpus all -v dot_ollama:/root/.ollama -p 11434:11434 -e GGML_VK_VISIBLE_DEVICES=0 -e OLLAMA_DEBUG=1 --hostname=ollama --rm --name ollama local/ollama:vulkan-v0.9.0

root@ollama:/# vulkaninfo | grep "GPU id"
GPU id = 0 (Quadro M620)
GPU id = 1 (llvmpipe (LLVM 15.0.7, 256 bits))
root@ollama:/# ollama run qwen3:4b-q4_K_M --verbose
Error: llama runner process has terminated: exit status 2

From docker logs ollama:

load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  35 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device CPU, is_swa = 0
load_tensors: offloaded 21/37 layers to GPU
ggml_vulkan: Device memory allocation of size 635473408 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 635473408

The vanilla ollama 0.9.0 can offload 13 layers to CUDA0 by default, 15 layers manually /set parameter num_gpu 15, and encounter CUDA error: out of memory when attempting 16 layers.

I also try to run the model on both GPUs together: docker run -d --cap-add=PERFMON --gpus all --device /dev/dri -v dot_ollama:/root/.ollama -p 11434:11434 -e GGML_VK_VISIBLE_DEVICES=0,1 -e OLLAMA_DEBUG=1 --hostname=ollama --rm --name ollama local/ollama:vulkan-v0.9.0. It doesn't offload the model to the 2 GPUs as I hoped because it tried to offload to Vulkan0 and Vulkan1, two instances of Quadro M620 in a separate run. Only to one GPU at a time, without estimating the number of layers.

root@ollama:/# vulkaninfo | grep "GPU id"
            GPU id = 0 (Quadro M620)
            GPU id = 1 (Intel(R) HD Graphics 630 (KBL GT2))
            GPU id = 2 (llvmpipe (LLVM 15.0.7, 256 bits))
root@ollama:/# ollama run qwen3:4b-q4_K_M --verbose
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) HD Graphics 630 (KBL GT2)) - 23981 MiB free
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device Vulkan0, is_swa = 0
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
>>> who are you?
[completed slowly without errors]

>>> /set parameter main_gpu 1
>>> who are you?
Error: llama runner process has terminated: error loading model: unable to allocate Vulkan0 buffer

From docker logs ollama:

llama_model_load_from_file_impl: failed to load model
llama_model_load_from_file_impl: using device Vulkan0 (Quadro M620) - 2048 MiB free
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device Vulkan0, is_swa = 0
ggml_vulkan: Device memory allocation of size 1068049408 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan0 buffer of size 1068049408
ggml_vulkan memory: ggml_backend_vk_buffer_free_buffer()
ggml_vulkan memory: Vulkan0: -1020.73 MiB device at 0x784aaa689be0. Total device: 0 B, total host: 0 B

It switches to Quadro M620 from HD Graphics 630, but attempts 37 layers without even the over-estimated 21 in the first case. If I manually /set parameter num_gpu 15 after /set parameter main_gpu 1, it completes the generation faster running on Quadro M620 partially without errors.

builker · 2025-06-05T23:35:37Z

Is there any coordination and priority among device requests via the environment variables CUDA_VISIBLE_DEVICES:-1 GGML_VK_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: ? Where is it implemented?

In envconfig/config.go:

var (
LLMLibrary = String("OLLAMA_LLM_LIBRARY")

    CudaVisibleDevices    = String("CUDA_VISIBLE_DEVICES")
    HipVisibleDevices     = String("HIP_VISIBLE_DEVICES")
    RocrVisibleDevices    = String("ROCR_VISIBLE_DEVICES")
    VkVisibleDevices      = String("GGML_VK_VISIBLE_DEVICES")
    GpuDeviceOrdinal      = String("GPU_DEVICE_ORDINAL")
    HsaOverrideGfxVersion = String("HSA_OVERRIDE_GFX_VERSION")

)

In ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp:

size_t num_available_devices = vk_instance.instance.enumeratePhysicalDevices().size();

// Emulate behavior of CUDA_VISIBLE_DEVICES for Vulkan
char * devices_env = getenv("GGML_VK_VISIBLE_DEVICES");
if (devices_env != nullptr) {
    std::string devices(devices_env);
    std::replace(devices.begin(), devices.end(), ',', ' ');

    std::stringstream ss(devices);
    size_t tmp;

grinco · 2025-06-06T10:21:49Z

Is there any coordination and priority among device requests via the environment variables CUDA_VISIBLE_DEVICES:-1 GGML_VK_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: ? Where is it implemented?

In envconfig/config.go:
var (
LLMLibrary = String("OLLAMA_LLM_LIBRARY")
    CudaVisibleDevices    = String("CUDA_VISIBLE_DEVICES")
    HipVisibleDevices     = String("HIP_VISIBLE_DEVICES")
    RocrVisibleDevices    = String("ROCR_VISIBLE_DEVICES")
    VkVisibleDevices      = String("GGML_VK_VISIBLE_DEVICES")
    GpuDeviceOrdinal      = String("GPU_DEVICE_ORDINAL")
    HsaOverrideGfxVersion = String("HSA_OVERRIDE_GFX_VERSION")
)
In ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp:
size_t num_available_devices = vk_instance.instance.enumeratePhysicalDevices().size();

// Emulate behavior of CUDA_VISIBLE_DEVICES for Vulkan
char * devices_env = getenv("GGML_VK_VISIBLE_DEVICES");
if (devices_env != nullptr) {
    std::string devices(devices_env);
    std::replace(devices.begin(), devices.end(), ',', ' ');

    std::stringstream ss(devices);
    size_t tmp;

See whyvl#22 and whyvl#7 (comment) some of it is already in #9650 - I'll see if I'll have time in the next few weeks to attempt to merge the latest 0.9.0 code.

ddpasa · 2025-06-07T10:07:18Z

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

Split7fire · 2025-06-08T09:22:44Z

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

What gpu you working with?
I want to use RX5700XT but llama-cli --list-devices shows nothing, despite vulkaninfo list gpu correctly. I installed llama-cpp from brew on Aurora (Fedora kinoite).

Any help appreciated.

kth8 · 2025-06-08T15:47:54Z

@Split7fire You need to compile llama.cpp yourself with the -D GGML_VULKAN=ON flag. llama.cpp from Homebrew has not been compiled with that https://github.com/Homebrew/homebrew-core/blob/831392863f0b47c58cd1db68bc85db2ab48b9ed9/Formula/l/llama.cpp.rb

builker · 2025-06-09T00:54:33Z

@Split7fire I change -DGGML_VULKAN=1 to -DGGML_VULKAN=ON in vulkan.Dockerfile. They are probably the same.

linuxd3v · 2025-06-09T13:08:39Z

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

I use self-compiled llama.cpp with vulkan, that works but switching models is such a chore on llama.cpp.
And then I have to maintain separate systemd files just to have different .gguf models loaded. just cumbersome.
also how do you update models? ollma at least has ollama pull built in, with l 10000 lama.cpp I have to find the model and use manual aria2c or wget2 to download.

shame. I really could use vulkan support in ollama. ai max 395 doesn't work with rocm well yet.

ericcurtin · 2025-06-09T14:47:38Z

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

I use self-compiled llama.cpp with vulkan, that works but switching models is such a chore. on llama.cpp. And then I have to maintain separate systemd files just to have different .gguf models loaded. just cumbersome. also how do you update models? ollma at least has ollama pull built in, with llama.ccp I have to use manual aria2c or wget2.

shame. I really could use vulkan support in ollama. ai max 395 doesn't work with rocm well yet.

This is an upcoming feature in ramalama with vulkan, stay tuned...

kth8 · 2025-06-09T16:16:38Z

I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it.

I use self-compiled llama.cpp with vulkan, that works but switching models is such a chore on llama.cpp. And then I have to maintain separate systemd files just to have different .gguf models loaded. just cumbersome. also how do you update models? ollma at least has ollama pull built in, with llama.cpp I have to find the model and use manual aria2c or wget2 to download.

shame. I really could use vulkan support in ollama. ai max 395 doesn't work with rocm well yet.

llama-swap can hot swap models with llama.cpp https://github.com/mostlygeek/llama-swap

lnicola · 2025-06-10T15:36:38Z

And then I have to maintain separate systemd files just to have different .gguf models loaded.

You might be able to use %I for the model in your unit, then e.g. systemctl start llama@ggml-org/gemma-3-1b-it-GGUF.

whyvl added 8 commits June 14, 2024 19:56

implement the vulkan C backend

f46b4a6

add support in gpu.go

9c6b049

add support in gen_linux.sh

93c4d69

it builds

24c8840

fix segfault

724fac4

fix compilation

e4e8a5d

fix free memory monitor

257364c

fix total memory monitor

11c55fa

whyvl mentioned this pull request Jun 15, 2024

First attempt at Vulkan: WIP, do not merge #2578

Open

whyvl and others added 5 commits June 15, 2024 12:01

Merge branch 'refs/heads/main' into vulkan

e77ea68

# Conflicts: # gpu/gpu.go

update gpu.go

18f3f96

fix build

38466f1

fix check_perfmon len

e3f9ca4

remove cap_get_bound check

b958cd2

whyvl mentioned this pull request Jun 15, 2024

Add Vulkan runner #2033

Open

ddpasa reviewed Jun 15, 2024

View reviewed changes

fix vulkan handle releasing

b6554e9

ddpasa reviewed Jun 16, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into vulkan

7fe16ea

Merge branch 'main' of https://github.com/ollama/ollama into vulkan

022b921

Add Vulkan support to ollama #5059

Are you sure you want to change the base?

Add Vulkan support to ollama #5059

Uh oh!

Conversation

Uh oh!

Edit: (2025/01/19)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!