8000 cpu type rpc worker by afazekas · Pull Request #1485 · containers/ramalama · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

cpu type rpc worker #1485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

cpu type rpc worker #1485

wants to merge 1 commit into from

Conversation

afazekas
Copy link
Contributor
@afazekas afazekas commented Jun 9, 2025

Each rpc worker only supports one type of device
if you want to use the remote machines CPU/RAM resources as well, you need to run cpu worker too.

This can be also useful in case you have a gpu rpc worker on your node and you do not want your initiator process to also use GPU.
Another use case in case you have an incompatible hardware and you want to by pass any issue regarding to partially working accelerator.

example usage
thread option not passed, assuming default is good.

cpu and cuda nodes:

 ssh 192.168.141.7  podman run --rm --network host  -it quay.io/ramalama_rpc/cpu  /usr/bin/rpc-server -p 50053 -H 0.0.0.0 &
 ssh 192.168.142.5  podman run --rm --network host  -it quay.io/ramalama_rpc/cpu  /usr/bin/rpc-server -p 50053 -H 0.0.0.0 &
ssh 192.168.142.5  podman run --gpus=all --runtime /usr/bin/nvidia-container-runtime --network host quay.io/ramalama/cuda:0.9 /usr/bin/rpc-server -p 50052 -H 0.0.0.0  &
ssh 192.168.141.7  podman run --gpus=all --runtime /usr/bin/nvidia-container-runtime --network host quay.io/ramalama/cuda:0.9 /usr/bin/rpc-server -p 50052 -H 0.0.0.0  &

initiator node with rocm:

RAMALAMA_LLAMACPP_RPC_NODES=192.168.142.5:50053,192.168.140.7:50053,192.168.142.5:50052,192.168.140.7:50052  ramalama serve --ctx 8192 file:///srv/llm/modles/unsloth/Qwen3-235B-A22B-GGUF/Q8_0/Qwen3-235B-A22B-Q8_0-00001-of-00006.gguf  --ngl 62 --model-draft huggingface://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf:latest
srv    load_model: loading model '/mnt/models/Qwen3-235B-A22B-Q8_0-00001-of-00006.gguf'
llama_model_load_from_file_impl: using device RPC[192.168.142.5:50053] (RPC[192.168.142.5:50053]) - 64223 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.140.7:50053] (RPC[192.168.140.7:50053]) - 61852 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.142.5:50052] (RPC[192.168.142.5:50052]) - 23862 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.140.7:50052] (RPC[192.168.140.7:50052]) - 23872 MiB free
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon RX 7900 XTX) - 24524 MiB free
llama_model_loader: additional 5 GGUFs metadata loaded.
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloaded 62/95 layers to GPU
load_tensors: RPC[192.168.142.5:50053] model buffer size = 52967.93 MiB
load_tensors: RPC[192.168.140.7:50053] model buffer size = 47923.36 MiB
load_tensors: RPC[192.168.142.5:50052] model buffer size = 17655.98 MiB
load_tensors: RPC[192.168.140.7:50052] model buffer size = 20178.26 MiB
load_tensors:   CPU_Mapped model buffer size = 47550.55 MiB
load_tensors:   CPU_Mapped model buffer size = 34423.68 MiB
load_tensors:        ROCm0 model buffer size = 17655.98 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
...

PS: the file:/// did not automatically internalized all the split files, I had to call it for each file, should be fixed later.

Only a few token per sec, but loads a 235B model in q8_0 on 3 HEDT machine.

Copy link
Contributor
sourcery-ai bot commented Jun 9, 2025

Reviewer's Guide

Adds support for CPU-only RPC workers by refactoring the build script’s CPU installation logic, introducing a new CPU containerfile, and updating build flags to use OpenBLAS.

File-Level Changes

Change Details Files
Refactor and generalize CPU installation in build script
  • Renamed dnf_install_s390 to dnf_install_cpu
  • Removed legacy s390/ppc comments and platform aliases
  • Added fallback to set BLAS_INCLUDE_DIRS when openblas.pc is missing
container-images/scripts/build_llama_and_whisper.sh
Enable CPU as a supported worker type
  • Added 'cpu' branch in dnf_install()
  • Introduced 'cpu' case in main() to append OpenBLAS flags
  • Created new Containerfile for CPU builds
container-images/scripts/build_llama_and_whisper.sh
container-images/cpu/Containerfile

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any tim 8000 e.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor
@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @afazekas - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@rhatdan
Copy link
Member
rhatdan commented Jun 9, 2025

@ericcurtin Thoughts?

Each rpc worker only supports one type of device
if you want to use the remote machines CPU/RAM resources
as well, you need to run cpu worker too.

This can be also useful in case you have a gpu rpc worker
on your node and you do not want your initiator process to
also use GPU.
Another use case in case you have an incompatible hardware
and you want to by pass any issue regarding to partially working
accelerator.

Signed-off-by: Attila Fazekas <afazekas@redhat.com>
@ericcurtin
Copy link
Collaborator
ericcurtin commented Jun 9, 2025

I would just add the:

if ! pkg-config --exists openblas; then

lines to dnf_install_s390 . We don't need a separate CPU inferencing container image. The "ramalama" image is intended for CPU inferencing. It's just duplication of scripting here. But what I'm trying to avoid more than this is having to build another type of container image when the reason we need it is unclear.

@ericcurtin
Copy link
Collaborator
ericcurtin commented Jun 9, 2025

I see:

dnf_install_s390

function was renamed to:

dnf_install_cpu

this is fine, but something like dnf_install_openblas would be better. We don't use openblas for all CPU-based inferencing, on most platforms we don't install it for CPU inferencing.

@rhatdan
Copy link
Member
rhatdan commented Jun 9, 2025

I agree we should avoid this at cost, since I just spent a few hours building them.

@afazekas
Copy link
Contributor Author
afazekas commented Jun 9, 2025

I would just add the:

if ! pkg-config --exists openblas; then

lines to dnf_install_s390 . We don't need a separate CPU inferencing container image. The "ramalama" image is intended for CPU inferencing. It's just duplication of scripting here. But what I'm trying to avoid more than this is having to build another time of container image when the reason we need it is unclear.

Looks like times has changed,
I retest with the ramalama/vulkan image it seams now has '--thread' option and fall back to 'cpu' now.

podman run --rm --network host  -it quay.io/ramalama/ramalama:0.7  /usr/bin/rpc-server -p 50054 -H 0.0.0.0 --help
Usage: /usr/bin/rpc-server [options]

options:
  -h, --help                show this help message and exit
  -H HOST, --host HOST      host to bind to (default: 0.0.0.0)
  -p PORT, --port PORT      port to bind to (default: 50054)
  -m MEM,  --mem MEM        backend memory size (in MB)
  -c,      --cache          enable local file cache
podman run --rm --network host  -it quay.io/ramalama/ramalama:0.9  /usr/bin/rpc-server -p 50054 -H 0.0.0.0 --help
Usage: /usr/bin/rpc-server [options]

options:
  -h, --help                show this help message and exit
  -t,      --threads        number of threads for the CPU backend (default: 12)
  -d DEV,  --device         device to use
  -H HOST, --host HOST      host to bind to (default: 0.0.0.0)
  -p PORT, --port PORT      port to bind to (default: 50054)
  -m MEM,  --mem MEM        backend memory size (in MB)
  -c,      --cache          enable local file cache

@ericcurtin
Copy link
Collaborator

The "vulkan" image and the "ramalama" image will eventually be merged, just leaving the "ramalama" image, I'd focus on the "ramalama" image...

@rhatdan
Copy link
Member
rhatdan commented Jun 9, 2025

I stopped building the vulkan image a while ago. Only use ramalama for CPU inferencing. Although people are reporting performance issues with it right now.

@afazekas
Copy link
Contributor Author
afazekas commented Jun 9, 2025

Looks like passing 'CPU' or 'cpu' as dev accepted as CPU or excluding the device pass:

$ podman run --rm --network host --device=/dev/kfd --device=/dev/dri -it quay.io/ramalama/ramalama:0.9 /usr/bin/rpc-server -p 50053 --threads 8 -d foobar
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
error: unknown device: foobar
available devices:
Vulkan0: AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24560 MiB free)
CPU: AMD Ryzen 9 7900 12-Core Processor (128447 MiB, 128447 MiB free)
Invalid parameters
$ podman run --rm --network host --device=/dev/kfd --device=/dev/dri -it quay.io/ramalama/ramalama:0.9 /usr/bin/rpc-server -p 50053 --threads 12 -d CPU
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
create_backend: using CPU backend
Starting RPC server v2.0.0
endpoint : 127.0.0.1:50053
local cache : n/a
backend memory : 128447 MB
$ podman run --rm --network host -it quay.io/ramalama/ramalama:0.9 /usr/bin/rpc-server -p 50055
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = llvmpipe (LLVM 19.1.7, 256 bits) (llvmpipe) | uma: 0 | fp16: 1 | warp size: 8 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want.
create_backend: using Vulkan0 backend
Starting RPC server v2.0.0
endpoint : 127.0.0.1:50055
local cache : n/a
backend memory : 128447 MB

Testing is it actually working too.

@afazekas
Copy link
Contributor Author
afazekas commented Jun 9, 2025

Speed: 0.0 t/s On the UI still did not get a full response.
It was > 2t/s with draft model, ~1.8 t/s without.

Even tough starts up, it is very slow without the CPU container.

@afazekas
Copy link
Contributor Author
afazekas commented Jun 9, 2025

I assume the ramalama tring to use the CPU as Vulkan device which is significantly slower then the ggml software implementation for CPU, on the non rpc-server mode I cannot even select CPU as device.

GPU0:
VkPhysicalDeviceProperties:

    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 25.0.4 (104857604)
    vendorID          = 0x1002
    deviceID          = 0x744c
    deviceType        = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName        = AMD Radeon RX 7900 XTX (RADV NAVI31)
    pipelineCacheUUID = 961203c2-2e1b-831f-63f6-2a6f65afd585

GPU1:
VkPhysicalDeviceProperties:

    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 0.0.1 (1)
    vendorID          = 0x10005
    deviceID          = 0x0000
    deviceType        = PHYSICAL_DEVICE_TYPE_CPU
    deviceName        = llvmpipe (LLVM 20.1.2, 256 bits)
    pipelineCacheUUID = 32352e30-2e34-6161-6161-616161616161

In order to use the CPU optimalised software instead of the CPU vulkan driver a dedicated image is still needed as I see.

@ericcurtin
Copy link
Collaborator

I assume the ramalama tring to use the CPU as Vulkan device which is significantly slower then the ggml software implementation for CPU, on the non rpc-server mode I cannot even select CPU as device.

GPU0:

VkPhysicalDeviceProperties:

    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 25.0.4 (104857604)
    vendorID          = 0x1002
    deviceID          = 0x744c
    deviceType        = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName        = AMD Radeon RX 7900 XTX (RADV NAVI31)
    pipelineCacheUUID = 961203c2-2e1b-831f-63f6-2a6f65afd585

GPU1:

VkPhysicalDeviceProperties:

    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 0.0.1 (1)
    vendorID          = 0x10005
    deviceID          = 0x0000
    deviceType        = PHYSICAL_DEVICE_TYPE_CPU
    deviceName        = llvmpipe (LLVM 20.1.2, 256 bits)
    pipelineCacheUUID = 32352e30-2e34-6161-6161-616161616161

In order to use the CPU optimalised software instead of the CPU vulkan driver a dedicated image is still needed as I see.

Yes this is something we are looking into, but in the context of this PR which is all about s390, I don't think we have to worry about that as GGML_VULKAN isn't compiled in:

  ramalama)
    if [ "$uname_m" = "x86_64" ] || [ "$uname_m" = "aarch64" ]; then
      common_flags+=("-DGGML_VULKAN=ON")
    elif [ "$uname_m" = "s390x" ]; then
      common_flags+=("-DGGML_VXE=ON" "-DGGML_BLAS=ON" "-DGGML_BLAS_VENDOR=OpenBLAS")
    fi
    ;;

@taronaeo taronaeo self-requested a review June 10, 2025 07:45
@taronaeo
Copy link
Collaborator

Hi correct me if im wrong but as far as I can tell, the changes can be merged together into the ramalama image right? All I'm seeing is a function rename from dnf_install_s390 to dnf_install_cpu with better OpenBLAS package detection.

@ericcurtin
We don't use openblas for all CPU-based inferencing, on most platforms we don't install it for CPU inferencing.

Wait was there a reason why we omitted OpenBLAS for CPU-based inferencing (except Apple because they use the Accelerate framework instead)? I remember doing benchmarks before-and-after OpenBLAS and there was quite an improvement, at least on the s390x architecture.

P.S., adding myself as a reviewer because it mentioned s390 ;)

@rhatdan
Copy link
Member
rhatdan commented Jun 10, 2025

Well drop the extra cpu containerfile, and then we can merge this in CPU and RamaLama support being equivalent.

@ericcurtin
Copy link
Collaborator
ericcurtin commented Jun 11, 2025

@taronaeo openblas is not commonly used for CPU inferencing from what I can see. But you guys found a performance advantage on s390, so happy to add it just for that architecture.

For x86/ARM we build VULKAN images that can also do CPU inferencing.

So to turn on openblas for non-s390 arches, we would have to prove it actually performs better across the board on those arches and plays nicely with vulkan being built in.

@ericcurtin
Copy link
Collaborator

We can turn it on for powerpc if that's beneficial for you also

@ericcurtin
Copy link
Collaborator
ericcurtin commented Jun 11, 2025

@taronaeo note llama.cpp upstream do not turn on openblas for CPU inferencing in their containers:

https://github.com/ggml-org/llama.cpp/blob/master/.devops/cpu.Dockerfile

@taronaeo
Copy link
Collaborator
taronaeo commented Jun 11, 2025

@taronaeo openblas is not commonly used for CPU inferencing from what I can see. But you guys found a performance advantage on s390, so happy to add it just for that architecture.

For x86/ARM we build VULKAN images that can also do CPU inferencing.

So to turn on openblas for non-s390 arches, we would have to prove it actually performs better across the board on those arches and plays nicely with vulkan being built in.

Well this genuinely stumped me. I did my own tests and you're right! Turns out AMD64 and ARM64 already include their own optimised GEMM and GEMV compute kernels with SIMD Vector Intrinsics that outperforms OpenBLAS in Prompt Processing by roughly 51%.

I'll re-evaluate the performance difference between non-OpenBLAS and OpenBLAS again for s390x architecture to see if our previous benchmarks still hold the same performance difference. I learn something new every day, thank you for clarifying :)

Edit: I've re-evaluated and OpenBLAS still worked better on s390x :)

@taronaeo
Copy link
Collaborator

Hi @afazekas any update on this PR? :)

@ericcurtin
Copy link
Collaborator
ericcurtin commented Jun 12, 2025

@taronaeo there's a tonne of backends in llama.cpp, generic cpu, openblas, rocm, cuda, vulkan, musa, cann... These are just the ones RamaLama has enabled there's more. It's an explicit goal of RamaLama to try to pick the best inferencing backend for your system.

vulkan and generic cpu, is actually the same container in RamaLama. Thanks great work from @0cc4m the automatic selection between vulkan and generic cpu will improve in the next version of RamaLama.

@taronaeo
Copy link
Collaborator

@taronaeo there's a tonne of backends in llama.cpp, generic cpu, openblas, rocm, cuda, vulkan, musa, cann... These are just the ones RamaLama has enabled there's more. It's an explicit goal of RamaLama to try to pick the best inferencing backend for your system.

vulkan and generic cpu, it actually the same container in RamaLama. Thanks great work from @0cc4m the automotic selection between vulkan and generic cpu will improve in the next version of RamaLama.

Yeah, just wanted to check if this PR is still going through with

  1. Creating a separate CPU image (which is what this PR is doing)
  2. Activating OpenBLAS for all CPU-only backends

I'm looking more into point 2 because it affects performance for CPU-only compute. If OpenBLAS is enabled by default for AMD64 and ARM64 as seen in this PR container-images/scripts/build_llama_and_whisper.sh:L335-L337, we will see a performance regression as tested in my comment earlier.

So was wondering if there is an update for the direction that this PR is going, and if this PR is still going through, we should only enable OpenBLAS for s390x.

@ericcurtin
Copy link
Collaborator
ericcurtin commented Jun 12, 2025

These are the only lines worth considering IMO, if they actually fix something:

  if  ! pkg-config --exists openblas; then
    export BLAS_INCLUDE_DIRS=/usr/include/openblas
    export BLAS_LIBRARY_DIRS=/usr/lib64
  fi

Even the rename from s390 -> cpu isn't great because openblas isn't the generic by default cpu backend, it's s390 specific.

@@ -151,7 +155,7 @@ dnf_install() {
if [ "$uname_m" = "x86_64" ] || [ "$uname_m" = "aarch64" ]; then
dnf_install_mesa # on x86_64 and aarch64 we use vulkan via mesa
else
dnf_install_s390
dnf_install_cpu
Copy link
Collaborator
@ericcurtin ericcurtin Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we call this for ppc? I don't have a ppc system to test... The else suggests we should use openblas for ppc (and risc-v). We probably shouldn't have an else in general, because made someone wants risc-v soon and I would rather test openblas on risc-v and ppc than blindly enable this non-default backend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I suggest:

elif [ "$uname_m" = "s390x" ]; then

to align with the compile time option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0