-
Notifications
You must be signed in to change notification settings - Fork 205
cpu type rpc worker #1485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
cpu type rpc worker #1485
Conversation
Reviewer's GuideAdds support for CPU-only RPC workers by refactoring the build script’s CPU installation logic, introducing a new CPU containerfile, and updating build flags to use OpenBLAS. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @afazekas - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@ericcurtin Thoughts? |
Each rpc worker only supports one type of device if you want to use the remote machines CPU/RAM resources as well, you need to run cpu worker too. This can be also useful in case you have a gpu rpc worker on your node and you do not want your initiator process to also use GPU. Another use case in case you have an incompatible hardware and you want to by pass any issue regarding to partially working accelerator. Signed-off-by: Attila Fazekas <afazekas@redhat.com>
I would just add the:
lines to dnf_install_s390 . We don't need a separate CPU inferencing container image. The "ramalama" image is intended for CPU inferencing. It's just duplication of scripting here. But what I'm trying to avoid more than this is having to build another type of container image when the reason we need it is unclear. |
I see: dnf_install_s390 function was renamed to: dnf_install_cpu this is fine, but something like dnf_install_openblas would be better. We don't use openblas for all CPU-based inferencing, on most platforms we don't install it for CPU inferencing. |
I agree we should avoid this at cost, since I just spent a few hours building them. |
Looks like times has changed,
|
The "vulkan" image and the "ramalama" image will eventually be merged, just leaving the "ramalama" image, I'd focus on the "ramalama" image... |
I stopped building the vulkan image a while ago. Only use ramalama for CPU inferencing. Although people are reporting performance issues with it right now. |
Looks like passing 'CPU' or 'cpu' as dev accepted as CPU or excluding the device pass: $ podman run --rm --network host --device=/dev/kfd --device=/dev/dri -it quay.io/ramalama/ramalama:0.9 /usr/bin/rpc-server -p 50053 --threads 8 -d foobar Testing is it actually working too. |
Speed: 0.0 t/s On the UI still did not get a full response. Even tough starts up, it is very slow without the CPU container. |
I assume the ramalama tring to use the CPU as Vulkan device which is significantly slower then the ggml software implementation for CPU, on the non rpc-server mode I cannot even select CPU as device. GPU0:
|
Yes this is something we are looking into, but in the context of this PR which is all about s390, I don't think we have to worry about that as GGML_VULKAN isn't compiled in:
|
Hi correct me if im wrong but as far as I can tell, the changes can be merged together into the
Wait was there a reason why we omitted OpenBLAS for CPU-based inferencing (except Apple because they use the Accelerate framework instead)? I remember doing benchmarks before-and-after OpenBLAS and there was quite an improvement, at least on the P.S., adding myself as a reviewer because it mentioned |
Well drop the extra cpu containerfile, and then we can merge this in CPU and RamaLama support being equivalent. |
@taronaeo openblas is not commonly used for CPU inferencing from what I can see. But you guys found a performance advantage on s390, so happy to add it just for that architecture. For x86/ARM we build VULKAN images that can also do CPU inferencing. So to turn on openblas for non-s390 arches, we would have to prove it actually performs better across the board on those arches and plays nicely with vulkan being built in. |
We can turn it on for powerpc if that's beneficial for you also |
@taronaeo note llama.cpp upstream do not turn on openblas for CPU inferencing in their containers: https://github.com/ggml-org/llama.cpp/blob/master/.devops/cpu.Dockerfile |
Well this genuinely stumped me. I did my own tests and you're right! Turns out AMD64 and ARM64 already include their own optimised GEMM and GEMV compute kernels with SIMD Vector Intrinsics that outperforms OpenBLAS in Prompt Processing by roughly 51%. I'll re-evaluate the performance difference between non-OpenBLAS and OpenBLAS again for Edit: I've re-evaluated and OpenBLAS still worked better on |
Hi @afazekas any update on this PR? :) |
@taronaeo there's a tonne of backends in llama.cpp, generic cpu, openblas, rocm, cuda, vulkan, musa, cann... These are just the ones RamaLama has enabled there's more. It's an explicit goal of RamaLama to try to pick the best inferencing backend for your system. vulkan and generic cpu, is actually the same container in RamaLama. Thanks great work from @0cc4m the automatic selection between vulkan and generic cpu will improve in the next version of RamaLama. |
Yeah, just wanted to check if this PR is still going through with
I'm looking more into point 2 because it affects performance for CPU-only compute. If OpenBLAS is enabled by default for AMD64 and ARM64 as seen in this PR So was wondering if there is an update for the direction that this PR is going, and if this PR is still going through, we should only enable OpenBLAS for s390x. |
These are the only lines worth considering IMO, if they actually fix something:
Even the rename from s390 -> cpu isn't great because openblas isn't the generic by default cpu backend, it's s390 specific. |
@@ -151,7 +155,7 @@ dnf_install() { | |||
if [ "$uname_m" = "x86_64" ] || [ "$uname_m" = "aarch64" ]; then | |||
dnf_install_mesa # on x86_64 and aarch64 we use vulkan via mesa | |||
else | |||
dnf_install_s390 | |||
dnf_install_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we call this for ppc? I don't have a ppc system to test... The else suggests we should use openblas for ppc (and risc-v). We probably shouldn't have an else in general, because made someone wants risc-v soon and I would rather test openblas on risc-v and ppc than blindly enable this non-default backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I suggest:
elif [ "$uname_m" = "s390x" ]; then
to align with the compile time option.
Each rpc worker only supports one type of device
if you want to use the remote machines CPU/RAM resources as well, you need to run cpu worker too.
This can be also useful in case you have a gpu rpc worker on your node and you do not want your initiator process to also use GPU.
Another use case in case you have an incompatible hardware and you want to by pass any issue regarding to partially working accelerator.
example usage
thread option not passed, assuming default is good.
cpu and cuda nodes:
initiator node with rocm:
PS: the file:/// did not automatically internalized all the split files, I had to call it for each file, should be fixed later.
Only a few token per sec, but loads a 235B model in q8_0 on 3 HEDT machine.