-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Add Vulkan support to ollama #5059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Are there any available instructions or guides that outline the steps to install Ollama from its source code on a Windows operating system? I have a Windows 10 machine equipped with an Arc A770 GPU with 8GB of memory |
https://github.com/ollama/ollama/blob/main/docs/development.md |
I compiled and ran this on Linux (arch, with Intel iGPU). It seems to work as correctly, with the performance and output similar to my hacky version on #2578 . I think we can abandon my version in favour of this (it was never meant to be merged anyway). |
gpu/gpu.go
Outdated
index: i, | ||
} | ||
|
||
C.vk_check_vram(*vHandles.vulkan, C.int(i), &memInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be nice to have a debugging log here printing the amount of memory detected. (especially with iGPUs this number can be useful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't ollama do it already? When I was debugging I saw something like Jun 15 20:25:32 rofl strace[403896]: time=2024-06-15T20:25:32.702+08:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Arc(tm) A770 Graphics (DG2)" total="15.9 GiB" available="14.3 GiB"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, but I don't that line exactly. Looks like a CAP_PERFMON thing or I messed up compilation:
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Plus Graphics (ICL GT2) | uma: 1 | fp16: 1 | warp size: 32
llama_new_context_with_model: Vulkan_Host output buffer size = 0.14 MiB
llama_new_context_with_model: Vulkan0 compute buffer size = 234.06 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB
Vulkan.time=2024-06-16T13:20:27.582+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib64/libvulkan.so.1.3.279 /usr/lib64/libcap.so.2.69=error !BADKEY="performance monitoring is not allowed. Please enable CAP_PERFMON or run as root to use Vulkan."
nvtop reveals iGPU being used as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe run ollama as root? Or do setcap cap_perfmon=+ep /path/to/ollama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
setcap didn't work for some reason, I still get CAP_PERFMON errors. But running with sudo gives:
time=2024-06-16T13:52:35.115+02:00 level=INFO source=gpu.go:355 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=oneapi compute="" driver=0.0 name="Intel(R) Iris(R) Plus Graphics" total="0 B" available="0 B"
time=2024-06-16T13:52:35.130+02:00 level=INFO source=types.go:102 msg="inference compute" id=0 library=vulkan compute=1.3 driver=1.3 name="Intel(R) Iris(R) Plus Graphics (ICL GT2)" total="11.4 GiB" available="8.4 GiB"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vulkan is reporting that the device is a CPU. If it's an iGPU it should've been detected.
You mentioned the performance was similar to when you were testing your branch. Are you sure you are not using CPU inference the entire time? Can you compare the performance against a CPU runner like cpu_avx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second read nevermind. It seems like everything is working as expected. Ollama detected two Vulkan devices, one is a CPU software implementation, which is skipped according to the error message, and the last line reports a Vulkan device that is recognized by ollama, which is the actual iGPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks right. There is also a lot of oneAPI junk in the logs that confuses me. But it looks like Vulkan works as intended, but I have a CAP_PERFMON problem.
nvtop screenshot below:
I wonder why setcap does not work... Could it be that one of the shared libraries (like libcap or libvulkan) needs a setcap instead of ollama binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the CAP_PERFMON is likely due to something off in my system. It's trying to load the 32bit library for some reason:
time=2024-06-16T14:58:06.326+02:00 level=DEBUG source=gpu.go:649 msg="Unable to load vulkan" library=/usr/lib/libvulkan.so.1.3.279 /usr/lib32/libcap.so.2.69=error !BADKEY="Unable to load /usr/lib32/libcap.so.2.69 library to query for Vulkan GPUs: /usr/lib32/libcap.so.2.69: wrong ELF class: ELFCLASS32"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading 32bit is expected, it's not related because it'll just skip it when it realizes it can't load it.
gpu/gpu_linux.go
Outdated
<
6293
/summary>
}
var capLinuxGlobs = []string{
"/usr/lib/x86_64-linux-gnu/libcap.so*",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding * after /user/lib also detects 32bit libraries in the system. Not sure if you want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this depends on the OS? I need to specify only lib64 on fedora for this to work as lib is 32bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, regarding x86_64
specific usage, this doesn't work on aarch64
like my Raspberry Pi :)
} | ||
|
||
var capLinuxGlobs = []string{ | ||
"/usr/lib/x86_64-linux-gnu/libcap.so*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding * after /user/lib also detects 32bit libraries in the system. Not sure if you want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this depends on the OS? I need to specify only lib64 on fedora for this to work as lib is 32bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, regarding x86_64
specific usage, this doesn't work on aarch64
like my Raspberry Pi :)
@dhiltgen mind reviewing this? I'd imagine this would be pretty useful for plenty of people. Intel ARC GPUs perform faster on Vulkan than with oneapi and oneapi is still not packaged yet on NixOS. Someone has also mailed me noting how Vulkan support has let them run ollama on Polaris GPUs much faster. |
Working well here on an RX 5700 (the notorious gfx1010). Hoping I can use rocm again with 6.2, but this is a great alternative. |
I managed to run this on Windows and an AMD GPU, and if I'm successful I'll reciprocate the way I tried. |
Interesting, I had expected that because I hadn't implemented Vulkan library loading in Windows it wouldn't have detected any Vulkan devices. Please do share how you did it. |
I'll add the corresponding code, but I'm not that familiar with vulkan and this may take time. |
Not just Arc, but it also gives nice speedups in Intel iGPUs too (Iris series). |
It works perfectly on Arch Linux with my RX 6700XT as well which doesn't have official ROCm support. I did encounter a couple of hiccups while setting it up, though they're probably distro specific issues with my Arch Linux installation. I'll post the changes I made just for the record.
otherwise Ollama wouldn't compile with Vulkan support.
|
I'm going to toss a copule cents into this by emphasizing: adding vulkan support to ollama will have an environmental impact. |
++ I would even be able to use my iGPU on my Laptop without any special drivers what not. And please let's not forget the performance increase, with an token inference increase of up to 100% it should actually be a nobrainer. |
I had written off an old ETH mining rig I built almost a decade ago exactly as e-waste until I discovered Vulkan backend after failing to get ROCm to work which motivated me to create these repos to bring new life back into these hardware: |
The more I see such community efforts/solutions, the more I think Vulkan should have been added to Ollama long time ago. |
For those who want to use Vulkan, I recommend using llama.cpp server directly. Ollama team clearly does not care about this. |
I still don't understand why so much resistance to (what seems to me) such an obviously good thing. Has the Ollama team provided some valid reasoning for this? |
I have no idea. I have given up on Ollama. llama.cpp just works. |
Somewhere an ollama developer mentioned that supporting some other feature is more important than new backends.
One of the nice things (for me) about |
No, llama.cpp is just an inference engine so you need manage the models yourself. |
That's not quite correct. llama.cpp can pull models from huggingface |
it can download models from HF but to manage them like list what you have already downloaded, check their sizes, removing them, etc you need to perform those actions manually |
|
Ollama provides better ux, which is likely what makes it the most popular inference server. People say just use llama.cpp. It definitely raises the bar for people that are just getting into local llms, even if dabbling in the .cache dir might seem trivial to you. ux/dx matters. Also why Vulkan support here matters. |
You can use ramalama, which is a llama.cpp wrapper |
I really like ramalama's idea of using simple containerized wrappers over established frameworks. However my experience with ramalama has not been smooth as every update breaks something from the previous version on my machine since 2 months ago (the latest release even automatically removed all my running local models). It's understandable as it is still in version 0 and unstable, but for this reason it probably should not be considered for any purpose other than basic testing atm. Look forward to seeing it grow though. |
In my experience (cpu only, weak hardware), ollama was hardly... hardly usable at all. It uses a modified version of llama.cpp over which I don't have any control. On longer queries, ollama would frequently crash without clear error messages... Using llama.cpp directly is way easier: it just works, it gives much finer control over options, supports more models, does not require docker/podman, shows better log messages. Managing downloaded models is trivial: they are stored in a folder with human readable names. Ollama may provide a better ux for standard use, when it works, but whenever you need something a little bit unusual, it sucks |
Yes, we are currently operating in a "move fast and break things"-mode, so we frequently include breaking changes to further improve ramalama such as the recent model store at the moment. Things should stabilize soon, though... stay tuned! :)
This is due to the migration from the old to the new storage - the local models haven't been deleted, but moved to a different location. It might cause issues upgrading ramalama while local models are running. Re-running the models should work, though. I am sorry for the inconvenience this causes and it is a one-time thing! |
@engelmi I did a bit of testing recently with RamaLama on a base M1 MBA running a 1B model. On bare metal the results were acceptable.
Running inside a GPU accelerated VM using libkrun with
I saw there used to be a
|
This is known, we are eliminating kompute soon and just focusing on "Vulkan" implementation of Vulkan. The thing is Vulkan has to go through a moltenvk compatibility layer (which translates vulkan to metal) so it's hard to beat metal direct (which is only a macOS thing). Luckily in RamaLama we give you both options. @kpouget is the RamaLama perf guru |
I test ollama 0.9.0-vulkan, this PR built with the latest patches in the EDIT - 20250530_0915 using my own Dockfile.vulkan. The test host, running Ubuntu 22.04 too, is an older PC and it has two 2 GPUs: Nvidia Quadro M620 (2GB VRAM) and Intel HD Graphics 630. It reveals some memory issues. Try to run vulkan on Quadro M620:
From docker logs ollama:
The vanilla ollama 0.9.0 can offload 13 layers to CUDA0 by default, 15 layers manually I also try to run the model on both GPUs together:
From docker logs ollama:
It switches to Quadro M620 from HD Graphics 630, but attempts 37 layers without even the over-estimated 21 in the first case. If I manually |
Is there any coordination and priority among device requests via the environment variables In envconfig/config.go:
In ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp:
|
See whyvl#22 and whyvl#7 (comment) some of it is already in #9650 - I'll see if I'll have time in the next few weeks to attempt to merge the latest 0.9.0 code. |
I have given up on Vulkan Ollama. The Ollama devs clearly don't care. llama.cpp server works very well with vulkan, and I recommend just using it. |
What gpu you working with? Any help appreciated. |
@Split7fire You need to compile llama.cpp yourself with the |
@Split7fire I change |
I use self-compiled llama.cpp with vulkan, that works but switching models is such a chore on llama.cpp. shame. I really could use vulkan support in ollama. ai max 395 doesn't work with rocm well yet. |
This is an upcoming feature in ramalama with vulkan, stay tuned... |
|
You might be able to use |
Edit: (2025/01/19)
It's been around 7 months and ollama devs don't seem to be interested in merging this PR. I'll maintain this fork as a separate project from now on. If you have any issues please raise them in the fork's repo so I can keep track of them.
This PR adds vulkan support to ollama with a proper memory monitoring implementation. This closes #2033 and replaces #2578 which does not implement proper memory monitoring.
Note that this implementation does not support GPU without
VkPhysicalDeviceMemoryBudgetPropertiesEXT
support. This shouldn't be a problem since on Linux the mesa driver supports it for all Intel devices afaik.CAP_PERFMON
capability is also needed for memory monitoring. This can be done by specifically enablingCAP_PERFMON
when running ollama as a systemd service by addingAmbientCapabilities=CAP_PERFMON
to the service or just run ollama as root.Vulkan devices that are CPUs under the hood (e.g. llvmpipe) are also not supported. This is purposely done so to avoid accidentally using CPUs for accelerated inference. Let me know if you think this behavior should be changed.
I've not tested this on Windows nor have I implemented the logic for building ollama with Vulkan support yet because I don't use Windows. If someone can help me with this that would be great.
I've tested this on my machine with an Intel Arc A770: