cpu type rpc worker #1485

afazekas · 2025-06-09T11:15:07Z

Each rpc worker only supports one type of device
if you want to use the remote machines CPU/RAM resources as well, you need to run cpu worker too.

This can be also useful in case you have a gpu rpc worker on your node and you do not want your initiator process to also use GPU.
Another use case in case you have an incompatible hardware and you want to by pass any issue regarding to partially working accelerator.

example usage
thread option not passed, assuming default is good.

cpu and cuda nodes:

 ssh 192.168.141.7  podman run --rm --network host  -it quay.io/ramalama_rpc/cpu  /usr/bin/rpc-server -p 50053 -H 0.0.0.0 &
 ssh 192.168.142.5  podman run --rm --network host  -it quay.io/ramalama_rpc/cpu  /usr/bin/rpc-server -p 50053 -H 0.0.0.0 &
ssh 192.168.142.5  podman run --gpus=all --runtime /usr/bin/nvidia-container-runtime --network host quay.io/ramalama/cuda:0.9 /usr/bin/rpc-server -p 50052 -H 0.0.0.0  &
ssh 192.168.141.7  podman run --gpus=all --runtime /usr/bin/nvidia-container-runtime --network host quay.io/ramalama/cuda:0.9 /usr/bin/rpc-server -p 50052 -H 0.0.0.0  &

initiator node with rocm:

RAMALAMA_LLAMACPP_RPC_NODES=192.168.142.5:50053,192.168.140.7:50053,192.168.142.5:50052,192.168.140.7:50052  ramalama serve --ctx 8192 file:///srv/llm/modles/unsloth/Qwen3-235B-A22B-GGUF/Q8_0/Qwen3-235B-A22B-Q8_0-00001-of-00006.gguf  --ngl 62 --model-draft huggingface://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf:latest

srv    load_model: loading model '/mnt/models/Qwen3-235B-A22B-Q8_0-00001-of-00006.gguf'
llama_model_load_from_file_impl: using device RPC[192.168.142.5:50053] (RPC[192.168.142.5:50053]) - 64223 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.140.7:50053] (RPC[192.168.140.7:50053]) - 61852 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.142.5:50052] (RPC[192.168.142.5:50052]) - 23862 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.140.7:50052] (RPC[192.168.140.7:50052]) - 23872 MiB free
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon RX 7900 XTX) - 24524 MiB free
llama_model_loader: additional 5 GGUFs metadata loaded.
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloaded 62/95 layers to GPU
load_tensors: RPC[192.168.142.5:50053] model buffer size = 52967.93 MiB
load_tensors: RPC[192.168.140.7:50053] model buffer size = 47923.36 MiB
load_tensors: RPC[192.168.142.5:50052] model buffer size = 17655.98 MiB
load_tensors: RPC[192.168.140.7:50052] model buffer size = 20178.26 MiB
load_tensors:   CPU_Mapped model buffer size = 47550.55 MiB
load_tensors:   CPU_Mapped model buffer size = 34423.68 MiB
load_tensors:        ROCm0 model buffer size = 17655.98 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
...

PS: the file:/// did not automatically internalized all the split files, I had to call it for each file, should be fixed later.

Only a few token per sec, but loads a 235B model in q8_0 on 3 HEDT machine.

sourcery-ai · 2025-06-09T11:15:12Z

Reviewer's Guide

Adds support for CPU-only RPC workers by refactoring the build script’s CPU installation logic, introducing a new CPU containerfile, and updating build flags to use OpenBLAS.

File-Level Changes

Change	Details	Files
Refactor and generalize CPU installation in build script	Renamed dnf_install_s390 to dnf_install_cpu Removed legacy s390/ppc comments and platform aliases Added fallback to set BLAS_INCLUDE_DIRS when openblas.pc is missing	`container-images/scripts/build_llama_and_whisper.sh`
Enable CPU as a supported worker type	Added 'cpu' branch in dnf_install() Introduced 'cpu' case in main() to append OpenBLAS flags Created new Containerfile for CPU builds	`container-images/scripts/build_llama_and_whisper.sh` `container-images/cpu/Containerfile`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any tim 8000 e.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @afazekas - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 2 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

container-images/scripts/build_llama_and_whisper.sh

rhatdan · 2025-06-09T11:33:40Z

@ericcurtin Thoughts?

Each rpc worker only supports one type of device if you want to use the remote machines CPU/RAM resources as well, you need to run cpu worker too. This can be also useful in case you have a gpu rpc worker on your node and you do not want your initiator process to also use GPU. Another use case in case you have an incompatible hardware and you want to by pass any issue regarding to partially working accelerator. Signed-off-by: Attila Fazekas <afazekas@redhat.com>

ericcurtin · 2025-06-09T11:51:16Z

I would just add the:

if ! pkg-config --exists openblas; then

lines to dnf_install_s390 . We don't need a separate CPU inferencing container image. The "ramalama" image is intended for CPU inferencing. It's just duplication of scripting here. But what I'm trying to avoid more than this is having to build another type of container image when the reason we need it is unclear.

ericcurtin · 2025-06-09T11:54:08Z

I see:

dnf_install_s390

function was renamed to:

dnf_install_cpu

this is fine, but something like dnf_install_openblas would be better. We don't use openblas for all CPU-based inferencing, on most platforms we don't install it for CPU inferencing.

rhatdan · 2025-06-09T11:54:40Z

I agree we should avoid this at cost, since I just spent a few hours building them.

afazekas · 2025-06-09T12:10:36Z

I would just add the:
if ! pkg-config --exists openblas; then
lines to dnf_install_s390 . We don't need a separate CPU inferencing container image. The "ramalama" image is intended for CPU inferencing. It's just duplication of scripting here. But what I'm trying to avoid more than this is having to build another time of container image when the reason we need it is unclear.

Looks like times has changed,
I retest with the ramalama/vulkan image it seams now has '--thread' option and fall back to 'cpu' now.

podman run --rm --network host  -it quay.io/ramalama/ramalama:0.7  /usr/bin/rpc-server -p 50054 -H 0.0.0.0 --help
Usage: /usr/bin/rpc-server [options]

options:
  -h, --help                show this help message and exit
  -H HOST, --host HOST      host to bind to (default: 0.0.0.0)
  -p PORT, --port PORT      port to bind to (default: 50054)
  -m MEM,  --mem MEM        backend memory size (in MB)
  -c,      --cache          enable local file cache

podman run --rm --network host  -it quay.io/ramalama/ramalama:0.9  /usr/bin/rpc-server -p 50054 -H 0.0.0.0 --help
Usage: /usr/bin/rpc-server [options]

options:
  -h, --help                show this help message and exit
  -t,      --threads        number of threads for the CPU backend (default: 12)
  -d DEV,  --device         device to use
  -H HOST, --host HOST      host to bind to (default: 0.0.0.0)
  -p PORT, --port PORT      port to bind to (default: 50054)
  -m MEM,  --mem MEM        backend memory size (in MB)
  -c,      --cache          enable local file cache

ericcurtin · 2025-06-09T12:25:08Z

The "vulkan" image and the "ramalama" image will eventually be merged, just leaving the "ramalama" image, I'd focus on the "ramalama" image...

rhatdan · 2025-06-09T12:31:30Z

I stopped building the vulkan image a while ago. Only use ramalama for CPU inferencing. Although people are reporting performance issues with it right now.

afazekas · 2025-06-09T12:32:16Z

Looks like passing 'CPU' or 'cpu' as dev accepted as CPU or excluding the device pass:

$ podman run --rm --network host --device=/dev/kfd --device=/dev/dri -it quay.io/ramalama/ramalama:0.9 /usr/bin/rpc-server -p 50053 --threads 8 -d foobar
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
error: unknown device: foobar
available devices:
Vulkan0: AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24560 MiB free)
CPU: AMD Ryzen 9 7900 12-Core Processor (128447 MiB, 128447 MiB free)
Invalid parameters
$ podman run --rm --network host --device=/dev/kfd --device=/dev/dri -it quay.io/ramalama/ramalama:0.9 /usr/bin/rpc-server -p 50053 --threads 12 -d CPU
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
create_backend: using CPU backend
Starting RPC server v2.0.0
endpoint : 127.0.0.1:50053
local cache : n/a
backend memory : 128447 MB
$ podman run --rm --network host -it quay.io/ramalama/ramalama:0.9 /usr/bin/rpc-server -p 50055
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = llvmpipe (LLVM 19.1.7, 256 bits) (llvmpipe) | uma: 0 | fp16: 1 | warp size: 8 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want.
create_backend: using Vulkan0 backend
Starting RPC server v2.0.0
endpoint : 127.0.0.1:50055
local cache : n/a
backend memory : 128447 MB

Testing is it actually working too.

afazekas · 2025-06-09T13:16:10Z

Speed: 0.0 t/s On the UI still did not get a full response.
It was > 2t/s with draft model, ~1.8 t/s without.

Even tough starts up, it is very slow without the CPU container.

afazekas · 2025-06-09T13:56:34Z

I assume the ramalama tring to use the CPU as Vulkan device which is significantly slower then the ggml software implementation for CPU, on the non rpc-server mode I cannot even select CPU as device.

GPU0:
VkPhysicalDeviceProperties:

    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 25.0.4 (104857604)
    vendorID          = 0x1002
    deviceID          = 0x744c
    deviceType        = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName        = AMD Radeon RX 7900 XTX (RADV NAVI31)
    pipelineCacheUUID = 961203c2-2e1b-831f-63f6-2a6f65afd585

GPU1:
VkPhysicalDeviceProperties:

    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 0.0.1 (1)
    vendorID          = 0x10005
    deviceID          = 0x0000
    deviceType        = PHYSICAL_DEVICE_TYPE_CPU
    deviceName        = llvmpipe (LLVM 20.1.2, 256 bits)
    pipelineCacheUUID = 32352e30-2e34-6161-6161-616161616161

In order to use the CPU optimalised software instead of the CPU vulkan driver a dedicated image is still needed as I see.

ericcurtin · 2025-06-09T14:38:48Z

I assume the ramalama tring to use the CPU as Vulkan device which is significantly slower then the ggml software implementation for CPU, on the non rpc-server mode I cannot even select CPU as device.

GPU0:

VkPhysicalDeviceProperties:
    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 25.0.4 (104857604)
    vendorID          = 0x1002
    deviceID          = 0x744c
    deviceType        = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName        = AMD Radeon RX 7900 XTX (RADV NAVI31)
    pipelineCacheUUID = 961203c2-2e1b-831f-63f6-2a6f65afd585
GPU1:

VkPhysicalDeviceProperties:
    apiVersion        = 1.4.305 (4210993)
    driverVersion     = 0.0.1 (1)
    vendorID          = 0x10005
    deviceID          = 0x0000
    deviceType        = PHYSICAL_DEVICE_TYPE_CPU
    deviceName        = llvmpipe (LLVM 20.1.2, 256 bits)
    pipelineCacheUUID = 32352e30-2e34-6161-6161-616161616161
In order to use the CPU optimalised software instead of the CPU vulkan driver a dedicated image is still needed as I see.

Yes this is something we are looking into, but in the context of this PR which is all about s390, I don't think we have to worry about that as GGML_VULKAN isn't compiled in:

  ramalama)
    if [ "$uname_m" = "x86_64" ] || [ "$uname_m" = "aarch64" ]; then
      common_flags+=("-DGGML_VULKAN=ON")
    elif [ "$uname_m" = "s390x" ]; then
      common_flags+=("-DGGML_VXE=ON" "-DGGML_BLAS=ON" "-DGGML_BLAS_VENDOR=OpenBLAS")
    fi
    ;;

taronaeo · 2025-06-10T07:51:46Z

Hi correct me if im wrong but as far as I can tell, the changes can be merged together into the ramalama image right? All I'm seeing is a function rename from dnf_install_s390 to dnf_install_cpu with better OpenBLAS package detection.

@ericcurtin
We don't use openblas for all CPU-based inferencing, on most platforms we don't install it for CPU inferencing.

Wait was there a reason why we omitted OpenBLAS for CPU-based inferencing (except Apple because they use the Accelerate framework instead)? I remember doing benchmarks before-and-after OpenBLAS and there was quite an improvement, at least on the s390x architecture.

P.S., adding myself as a reviewer because it mentioned s390 ;)

rhatdan · 2025-06-10T08:09:39Z

Well drop the extra cpu containerfile, and then we can merge this in CPU and RamaLama support being equivalent.

ericcurtin · 2025-06-11T00:12:06Z

@taronaeo openblas is not commonly used for CPU inferencing from what I can see. But you guys found a performance advantage on s390, so happy to add it just for that architecture.

For x86/ARM we build VULKAN images that can also do CPU inferencing.

So to turn on openblas for non-s390 arches, we would have to prove it actually performs better across the board on those arches and plays nicely with vulkan being built in.

ericcurtin · 2025-06-11T00:13:57Z

We can turn it on for powerpc if that's beneficial for you also

ericcurtin · 2025-06-11T00:18:32Z

@taronaeo note llama.cpp upstream do not turn on openblas for CPU inferencing in their containers:

https://github.com/ggml-org/llama.cpp/blob/master/.devops/cpu.Dockerfile

taronaeo · 2025-06-11T07:26:05Z

@taronaeo openblas is not commonly used for CPU inferencing from what I can see. But you guys found a performance advantage on s390, so happy to add it just for that architecture.

For x86/ARM we build VULKAN images that can also do CPU inferencing.

So to turn on openblas for non-s390 arches, we would have to prove it actually performs better across the board on those arches and plays nicely with vulkan being built in.

Well this genuinely stumped me. I did my own tests and you're right! Turns out AMD64 and ARM64 already include their own optimised GEMM and GEMV compute kernels with SIMD Vector Intrinsics that outperforms OpenBLAS in Prompt Processing by roughly 51%.

I'll re-evaluate the performance difference between non-OpenBLAS and OpenBLAS again for s390x architecture to see if our previous benchmarks still hold the same performance difference. I learn something new every day, thank you for clarifying :)

Edit: I've re-evaluated and OpenBLAS still worked better on s390x :)

taronaeo · 2025-06-12T07:35:11Z

Hi @afazekas any update on this PR? :)

ericcurtin · 2025-06-12T07:42:53Z

@taronaeo there's a tonne of backends in llama.cpp, generic cpu, openblas, rocm, cuda, vulkan, musa, cann... These are just the ones RamaLama has enabled there's more. It's an explicit goal of RamaLama to try to pick the best inferencing backend for your system.

vulkan and generic cpu, is actually the same container in RamaLama. Thanks great work from @0cc4m the automatic selection between vulkan and generic cpu will improve in the next version of RamaLama.

taronaeo · 2025-06-12T07:50:08Z

@taronaeo there's a tonne of backends in llama.cpp, generic cpu, openblas, rocm, cuda, vulkan, musa, cann... These are just the ones RamaLama has enabled there's more. It's an explicit goal of RamaLama to try to pick the best inferencing backend for your system.

vulkan and generic cpu, it actually the same container in RamaLama. Thanks great work from @0cc4m the automotic selection between vulkan and generic cpu will improve in the next version of RamaLama.

Yeah, just wanted to check if this PR is still going through with

Creating a separate CPU image (which is what this PR is doing)
Activating OpenBLAS for all CPU-only backends

I'm looking more into point 2 because it affects performance for CPU-only compute. If OpenBLAS is enabled by default for AMD64 and ARM64 as seen in this PR container-images/scripts/build_llama_and_whisper.sh:L335-L337, we will see a performance regression as tested in my comment earlier.

So was wondering if there is an update for the direction that this PR is going, and if this PR is still going through, we should only enable OpenBLAS for s390x.

ericcurtin · 2025-06-12T07:53:22Z

These are the only lines worth considering IMO, if they actually fix something:

  if  ! pkg-config --exists openblas; then
    export BLAS_INCLUDE_DIRS=/usr/include/openblas
    export BLAS_LIBRARY_DIRS=/usr/lib64
  fi

Even the rename from s390 -> cpu isn't great because openblas isn't the generic by default cpu backend, it's s390 specific.

ericcurtin · 2025-06-12T07:56:38Z

container-images/scripts/build_llama_and_whisper.sh

@@ -151,7 +155,7 @@ dnf_install() {
    if [ "$uname_m" = "x86_64" ] || [ "$uname_m" = "aarch64" ]; then
      dnf_install_mesa # on x86_64 and aarch64 we use vulkan via mesa
    else
-      dnf_install_s390
+      dnf_install_cpu


Also, should we call this for ppc? I don't have a ppc system to test... The else suggests we should use openblas for ppc (and risc-v). We probably shouldn't have an else in general, because made someone wants risc-v soon and I would rather test openblas on risc-v and ppc than blindly enable this non-default backend.

In other words, I suggest:

elif [ "$uname_m" = "s390x" ]; then

to align with the compile time option.

afazekas requested review from rhatdan, ericcurtin, bmahabirbu, maxamillion, swarajpande5, jhjaggars, cgruver, slp and engelmi as code owners June 9, 2025 11:15

sourcery-ai bot approved these changes Jun 9, 2025

View reviewed changes

container-images/scripts/build_llama_and_whisper.sh Show resolved Hide resolved

container-images/scripts/build_llama_and_whisper.sh Outdated Show resolved Hide resolved

afazekas force-pushed the rpc-cpu branch from a0902ac to 7179592 Compare June 9, 2025 11:39

taronaeo self-requested a review June 10, 2025 07:45

ericcurtin reviewed Jun 12, 2025

View reviewed changes

cpu type rpc worker #1485

Are you sure you want to change the base?

cpu type rpc worker #1485

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GPU0: VkPhysicalDeviceProperties:

GPU1: VkPhysicalDeviceProperties:

Uh oh!

GPU0:

GPU1:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GPU0:
VkPhysicalDeviceProperties:

GPU1:
VkPhysicalDeviceProperties: