discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844

gkpln3 · 2025-05-24T15:58:36Z

This PR builds on top of the work done by @ecyht2 on #6729, following issue #4643.
It aims to add RPC support to Ollama based on llama.cpp RPC mechanism to allow distributed inference across multiple devices.

This PR has been tested and confirmed working on MacOS (fixing a race condition in distributed inference). best performance can be achieved by connecting the devices using Thunderbolt 4.

This PR also adds the ollama rpc command that allows running the RPC server on the other computer.

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

…f chat

… support for runners

…ver components

…ructure

…/rpc

…h RPC on certain models

sempervictus · 2025-05-26T17:21:36Z

To clarify: same exact container built from this PR starts with GPUs just fine in its default mode but in rpc ... it only presents CPU resources. The same container running as a client of the RPC only detects CPU resources available.

That said ... it does find multiple RPC server backend and their general-purpose memory 😄

sempervictus · 2025-05-26T17:27:39Z

@gkpln3 (and probably @rick-github + team given the potential visibility of this track) - the team at WhiteFiber have graciously offered a couple of 8-way B200 hosts and possibly some H200s on various fabrics to help test once we get the basics going (the people who run the place built what was probably the first GPUaaS cloud, they're very FOSS-friendly) so we may have a chance to see just how far this rocket ship goes in practical terms. I know that RPC isnt exactly as efficient as zero-copy RDMA but 4Tbit of interconnect is nothing at which to scoff (all 400G cards, 2x for the front-ends, 8 paired w/ GPUs: the fabrics are designed and implemented converged for maximum flexibility - this use-case being a good example of "why").

gkpln3 · 2025-05-26T18:14:49Z

@gkpln3 (and probably @rick-github + team given the potential visibility of this track) - the team at WhiteFiber have graciously offered a couple of 8-way B200 hosts and possibly some H200s on various fabrics to help test once we get the basics going (the people who run the place built what was probably the first GPUaaS cloud, they're very FOSS-friendly) so we may have a chance to see just how far this rocket ship goes in practical terms. I know that RPC isnt exactly as efficient as zero-copy RDMA but 4Tbit of interconnect is nothing at which to scoff (all 400G cards, 2x for the front-ends, 8 paired w/ GPUs: the fabrics are designed and implemented converged for maximum flexibility - this use-case being a good example of "why").

This is amazing! thanks!

To clarify: same exact container built from this PR starts with GPUs just fine in its default mode but in rpc ... it only presents CPU resources. The same container running as a client of the RPC only detects CPU resources available.

That said ... it does find multiple RPC server backend and their general-purpose memory 😄

Can you try running this?

ollama rpc --device list

it should print out all the devices its recognizing.

sempervictus · 2025-05-26T18:27:28Z

It is not happy doing that :-\

 docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --device list --port 50053
2025/05/26 18:26:03 rpc_server.go:25: Starting RPC server on 0.0.0.0:50053
2025/05/26 18:26:03 rpc_server.go:34: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

with no further output.

In case its relevant, the backing devices are 4x32G V100s on an SXM2 so v7 compatibility level, and obviously work on just ollama serve invocation.

sempervictus · 2025-05-26T18:40:55Z

@gkpln3 (and probably @rick-github + team given the potential visibility of this track) - the team at WhiteFiber have graciously offered a couple of 8-way B200 hosts and possibly some H200s on various fabrics to help test once we get the basics going (the people who run the place built what was probably the first GPUaaS cloud, they're very FOSS-friendly) so we may have a chance to see just how far this rocket ship goes in practical terms. I know that RPC isnt exactly as efficient as zero-copy RDMA but 4Tbit of interconnect is nothing at which to scoff (all 400G cards, 2x for the front-ends, 8 paired w/ GPUs: the fabrics are designed and implemented converged for maximum flexibility - this use-case being a good example of "why").

This is amazing! thanks!

To clarify: same exact container built from this PR starts with GPUs just fine in its default mode but in rpc ... it only presents CPU resources. The same container running as a client of the RPC only detects CPU resources available.
That said ... it does find multiple RPC server backend and their general-purpose memory 😄

Can you try running this?
ollama rpc --device list
it should print out all the devices its recognizing.

@asaiacai is spinning up and prepping the B200s; we'll run a single interface @ 400G for the socket test for the time being and can work out the multi-NIC thing and RDMA down the line but i imagine everyone's curious about the implications of chaining hardware together even in such a user-friendly modality.

@gkpln3 if you have any specific models, configurations, etc you'd like set up for the test please feel free to ask @asaiacai as we go.

Once we have the code built there we should also be quickly able to tell if this is an artifact of the stack on which i am testing though i do think that what i am seeing re "no GPUs" is a bug. :-)

gkpln3 · 2025-05-26T18:52:15Z

Got it, let me check about the GPU identification thing.

sempervictus · 2025-05-26T19:22:54Z

Rebuilding :-)

gkpln3 · 2025-05-26T19:23:25Z

Ok, I think I found the issue. seems like i wasn't initializing the ggml backend properly when loading the rpc server.
The fix I pushed might be a bit incomplete, ollama has some more complex library choosing flow that I've yet to add, but I think the fixed I pushed will fix the issue.

Lets see :)

sempervictus · 2025-05-26T19:33:18Z

No dice on 84aa6d0 unfortunately @gkpln3:

$ docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --device list --port 50053
2025/05/26 19:26:07 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 19:26:07 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 19:26:07 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:
^C2025/05/26 19:26:22 rpc_server.go:44: Shutting down RPC server...

^C
$ docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --port 50053
2025/05/26 19:31:53 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 19:31:53 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 19:31:53 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CPU backend

any chance that Docker skipped the relevant build step?

 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                       0.0s
 => => transferring dockerfile: 5.87kB                                                                                                                                                                                                                                     0.0s
 => WARN: FromPlatformFlagConstDisallowed: FROM --platform flag should not use constant value "linux/arm64" (line 66)                                                                                                                                                      0.0s
 => WARN: FromPlatformFlagConstDisallowed: FROM --platform flag should not use constant value "linux/arm64" (line 77)                                                                                                                                                      0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                                                                                                                                                            0.2s
 => [internal] load metadata for docker.io/rocm/dev-almalinux-8:6.3.3-complete                                                                                                                                                                                             0.2s
 => [internal] load .dockerignore                                                                                                                                                                                                                                          0.0s
 => => transferring context: 105B                                                                                                                                                                                                                                          0.0s
 => [stage-14 1/4] FROM docker.io/library/ubuntu:20.04@sha256:8feb4d8ca5354def3d8fce243717141ce31e2c428701f6682bd2fafe15388214                                                                                                                                             0.0s
 => [internal] load build context                                                                                                                                                                                                                                          0.0s
 => => transferring context: 62.48kB                                                                                                                                                                                                                                       0.0s
 => [base-amd64 1/2] FROM docker.io/rocm/dev-almalinux-8:6.3.3-complete@sha256:9b73d5d6c04f685b179fd6e24569f578de830a47f9ab69e99dd901f5e2aca184                                                                                                                            0.0s
 => CACHED [base-amd64 2/2] RUN yum install -y yum-utils     && yum-config-manager --add-repo https://dl.rockylinux.org/vault/rocky/8.5/AppStream/$basearch/os/     && rpm --import https://dl.rockylinux.org/pub/rocky/RPM-GPG-KEY-Rocky-8     && dnf install -y yum-uti  0.0s
 => CACHED [base 1/3] RUN curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.31.2/cmake-3.31.2-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1                                                                                          0.0s
 => CACHED [base 2/3] COPY CMakeLists.txt CMakePresets.json .                                                                                                                                                                                                              0.0s
 => CACHED [base 3/3] COPY ml/backend/ggml/ggml ml/backend/ggml/ggml                                                                                                                                                                                                       0.0s
 => CACHED [build 1/6] WORKDIR /go/src/github.com/ollama/ollama                                                                                                                                                                                                            0.0s
 => CACHED [build 2/6] COPY go.mod go.sum .                                                                                                                                                                                                                                0.0s
 => CACHED [build 3/6] RUN curl -fsSL https://golang.org/dl/go$(awk '/^go/ { print $2 }' go.mod).linux-$(case $(uname -m) in x86_64) echo amd64 ;; aarch64) echo arm64 ;; esac).tar.gz | tar xz -C /usr/local                                                              0.0s
 => CACHED [build 4/6] RUN go mod download                                                                                                                                                                                                                                 0.0s
 => [build 5/6] COPY . .                                                                                                                                                                                                                                                   0.2s
 => [build 6/6] RUN --mount=type=cache,target=/root/.cache/go-build     go build -trimpath -buildmode=pie -o /bin/ollama .                                                                                                                                                 2.5s
 => CACHED [cuda-11 1/2] RUN dnf install -y cuda-toolkit-11-3                                                                                                                                                                                                              0.0s
 => CACHED [cuda-11 2/2] RUN --mount=type=cache,target=/root/.ccache     cmake --preset 'CUDA 11'         && cmake --build --parallel --preset 'CUDA 11'         && cmake --install build --component CUDA --strip --parallel 8                                            0.0s
 => CACHED [amd64 1/2] COPY --from=cuda-11 dist/lib/ollama/cuda_v11 /lib/ollama/cuda_v11                                                                                                                                                                                   0.0s
 => CACHED [cuda-12 1/2] RUN dnf install -y cuda-toolkit-12-8                                                                                                                                                                                                              0.0s
 => CACHED [cuda-12 2/2] RUN --mount=type=cache,target=/root/.ccache     cmake --preset 'CUDA 12'         && cmake --build --parallel --preset 'CUDA 12'         && cmake --install build --component CUDA --strip --parallel 8                                            0.0s
 => CACHED [amd64 2/2] COPY --from=cuda-12 dist/lib/ollama/cuda_v12 /lib/ollama/cuda_v12                                                                                                                                                                                   0.0s
 => CACHED [cpu 1/2] RUN dnf install -y gcc-toolset-11-gcc gcc-toolset-11-gcc-c++                                                                                                                                                                                          0.0s
 => CACHED [cpu 2/2] RUN --mount=type=cache,target=/root/.ccache     cmake --preset 'CPU'         && cmake --build --parallel --preset 'CPU'         && cmake --install build --component CPU --strip --parallel 8                                                         0.0s
 => CACHED [archive 1/2] COPY --from=cpu dist/lib/ollama /lib/ollama                                                                                                                                                                                                       0.0s
 => [archive 2/2] COPY --from=build /bin/ollama /bin/ollama                                                                                                                                                                                                                0.2s
 => CACHED [stage-14 2/4] RUN apt-get update     && apt-get install -y ca-certificates     && apt-get clean     && rm -rf /var/lib/apt/lists/*                                                                                                                             0.0s
 => [stage-14 3/4] COPY --from=archive /bin /usr/bin                                                                                                                                                                                                                       0.2s
 => [stage-14 4/4] COPY --from=archive /lib/ollama /usr/lib/ollama                                                                                                                                                                                                        20.9s
 => exporting to image                                                                                                                                                                                                                                                    12.6s
 => => exporting layers                                                                                                                                                                                                                                                   12.5s
 => => writing image sha256:4c54c55a114a44c5eb75f7444280fa4c9e4e836d42733df084c709bff7379ddf                                                                                                                                                                               0.0s
 => => naming to docker.io/library/ollama:mnt

gkpln3 · 2025-05-26T19:36:06Z

any chance that Docker skipped the relevant build step?

I think not, the relevant one is:

 => [build 6/6] RUN --mount=type=cache,target=/root/.cache/go-build     go build -trimpath -buildmode=pie -o /bin/ollama .                                                                                                                                                 2.5s

and it seems like it ran it.

sempervictus · 2025-05-26T20:05:58Z

Can confirm that passing serve to the same binary in the container picks up the 4 V100s. So somewhere in the initialization routine the two paths diverge... Oh GoLang, how i wish you were Rust ;-).

Will start digging into the code this evening if the bug remains ellusive; and if the same issue is apparent on the B200s then @asaiacai should have an independent build & test stack up on that hardware shortly so we can validate whether this is confined to specific HW/compatibility rev.

asaiacai · 2025-05-26T20:11:28Z

seeing similar errors on my B200 setup.

root@trainy:~/ollama# docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama rpc --device list --port 50053
2025/05/26 20:10:59 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 20:10:59 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 20:10:59 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

gkpln3 · 2025-05-26T20:11:57Z

@sempervictus I'm pretty sure its related to the libraries that ollama passes to the runner process that we're not having in the ollama rpc command.
You can check the env vars on the ollama runner process after you run a model to see what env vars you get there that I'm missing in ollama rpc.

I can keep working on that later, I have to go for now

sempervictus · 2025-05-26T23:33:46Z

@gkpln3 - apologies for the lag, had to step out for a bit. In serve mode we get:

time=2025-05-26T23:26:09.202Z level=INFO source=routes.go:1206 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_RPC_SERVERS: OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-05-26T23:26:09.202Z level=INFO source=images.go:463 msg="total blobs: 0"
time=2025-05-26T23:26:09.202Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ol
F438
lama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-05-26T23:26:09.202Z level=INFO source=routes.go:1259 msg="Listening on [::]:11434 (version 0.0.0)"
time=2025-05-26T23:26:09.203Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-26T23:26:12.540Z level=INFO source=types.go:139 msg="inference compute" id=GPU-W library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-26T23:26:12.541Z level=INFO source=types.go:139 msg="inference compute" id=GPU-X library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-26T23:26:12.541Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Y library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-26T23:26:12.541Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Z library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"

i tried feeding the "relevant-seeming" ones into the rpc call, to no avail:

 docker run --rm --name ollama_test --gpus=all -p 50053:50053 -e CUDA_VISIBLE_DEVICES=all -e OLLAMA_CONTEXT_LENGTH=4096 -e OLLAMA_HOST:http://0.0.0.0:11434 -e OLLAMA_INTEL_GPU=false ollama:mnt rpc --device list
2025/05/26 23:32:56 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
2025/05/26 23:32:56 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 23:32:56 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

gkpln3 · 2025-05-27T06:50:04Z

@sempervictus I think the env vars you sent belong to the ollama serve process. ollama runner is a different process that ollama spins up to handle a specific model run.
try running a model, then run

ps -e | grep ollama runner

take the pid of the process you find, and take the value of

cat /proc/{pid}/environ

sempervictus · 2025-05-27T12:47:17Z

Apologies, here's the subprocess output for a codegeex4 runner:

NVIDIA_VISIBLE_DEVICES=allPYTHON_SHA256=849da87af4df137710c1796e276a955f7a85c9f971081067c8f565d15c352a09HOSTNAME=8733e33b9343PYTHON_VERSION=3.11.12WHISPER_MODEL=baseOPENAI_API_KEY=ENV=prodANONYMIZED_TELEMETRY=falsePWD=/app/backendOPENAI_API_BASE_URL=PORT=8080WEBUI_SECRET_KEY=OTsiSmG8OAKX+vxyOLLAMA_BASE_URL=/ollamaHOME=/rootLANG=C.UTF-8USE_EMBEDDING_MODEL_DOCKER=sentence-transformers/all-MiniLM-L6-v2SENTENCE_TRANSFORMERS_HOME=/app/backend/data/cache/embedding/modelsUSE_CUDA_DOCKER=falseGPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696DUSE_CUDA_DOCKER_VER=cu128USE_RERANKING_MODEL_DOCKER=USE_OLLAMA_DOCKER=trueCUDA_VISIBLE_DEVICES=GPU-UUIDHF_HOME=/app/backend/data/cache/embedding/modelsSCARF_NO_ANALYTICS=trueDO_NOT_TRACK=trueRAG_RERANKING_MODEL=SHLVL=0DOCKER=trueOLLAMA_MODEL_PARALLEL=8WHISPER_MODEL_DIR=/app/backend/data/cache/whisper/modelsTIKTOKEN_ENCODING_NAME=cl100k_baseRAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binTIKTOKEN_CACHE_DIR=/app/backend/data/cache/tiktokenOLDPWD=/app/backendWEBUI_BUILD_VERSION=737dc7797ce5eb0f3dbe08bd1ab6778ea59f538e_=/usr/local/bin/ollamaOLLAMA_MAX_LOADED_MODELS=12OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama

Interestingly it specifies CUDA_VISIBLE_DEVICES with the GPU UUID but NVIDIA_VISIBLE_DEVICES=all

gkpln3 · 2025-05-27T14:15:53Z

Apologies, here's the subprocess output for a codegeex4 runner:
XXXXXXX
Interestingly it specifies CUDA_VISIBLE_DEVICES with the GPU UUID but NVIDIA_VISIBLE_DEVICES=all

Can you try running rpc with these env vars?

docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e NVIDIA_VISIBLE_DEVICES=all -e OLLAMA_BASE_URL=/ollama -e USE_EMBEDDING_MODEL_DOCKER=sentence-transformers/all-MiniLM-L6-v2 -e SENTENCE_TRANSFORMERS_HOME=/app/backend/data/cache/embedding/models -e USE_CUDA_DOCKER=false -e USE_CUDA_DOCKER_VER=cu128 -e USE_RERANKING_MODEL_DOCKER= -e USE_OLLAMA_DOCKER=true -e CUDA_VISIBLE_DEVICES=GPU-UUID -e HF_HOME=/app/backend/data/cache/embedding/models -e OLLAMA_MODEL_PARALLEL=8 -e WHISPER_MODEL_DIR=/app/backend/data/cache/whisper/models -e TIKTOKEN_ENCODING_NAME=cl100k_base -e RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 -e TIKTOKEN_CACHE_DIR=/app/backend/data/cache/tiktoken -e OLLAMA_MAX_LOADED_MODELS=12O -e LLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama ollama rpc --device list

Also, I suggest you remove the env vars from the comment, it contains some additional values that shouldn't be public 😅

sempervictus · 2025-05-27T18:15:14Z

Appreciate the heads up - those are default values in a compose file (from a GH repo somewhere IIRC) which only exist if we don't supply overriding env vars when bringing up. Ours are fed by devops when used in prod so those aren't in actual use :-).

On the v12 forced run attempt - same effect. Cleared all ollama images, rebuilt the local one from this PR, and still get:

docker run --rm --name ollama_test --gpus=all -p 50053:50053 -e NVIDIA_VISIBLE_DEVICES=all -e OLLAMA_BASE_URL=/ollama -e USE_EMBEDDING_MODEL_DOCKER=sentence-transformers/all-MiniLM-L6-v2 -e SENTENCE_TRANSFORMERS_HOME=/app/backend/data/cache/embedding/models -e USE_CUDA_DOCKER=false -e USE_CUDA_DOCKER_VER=cu128 -e USE_RERANKING_MODEL_DOCKER= -e USE_OLLAMA_DOCKER=true -e CUDA_VISIBLE_DEVICES=GPU-UUID -e HF_HOME=/app/backend/data/cache/embedding/models -e OLLAMA_MODEL_PARALLEL=8 -e WHISPER_MODEL_DIR=/app/backend/data/cache/whisper/models -e TIKTOKEN_ENCODING_NAME=cl100k_base -e RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 -e TIKTOKEN_CACHE_DIR=/app/backend/data/cache/tiktoken -e OLLAMA_MAX_LOADED_MODELS=12O -e LLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama  ollama:mnt rpc --device list
2025/05/27 18:10:21 config.go:210: WARN invalid environment variable, using default key=OLLAMA_MAX_LOADED_MODELS value=12O default=0
2025/05/27 18:10:21 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
2025/05/27 18:10:21 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/27 18:10:21 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

gkpln3 · 2025-05-28T11:01:02Z

@sempervictus Thanks for the output, seems like it would require some extra investigation 😅
I'll update here when a solution is found

aquarat · 2025-05-28T12:34:13Z

I found that if I clone llama.cpp directly and checkout the commit in Makefile.sync and then build the RPC server, it doesn't seem to consistently work with this Ollama - but at least it does sometimes work 🎉

I get errors like this:

Accepted client connection, free_mem=25031671808, total_mem=25307578368
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 49293.35 MiB on device 0: cudaMalloc failed: out of memory
[alloc_buffer] size: 51687819264 -> failed

The target device has 24GBs of VRAM.

It does seem to work when I use "qwen3:235b-a22b-q4_K_M". It's quite exciting when it does work 😄

gkpln3 · 2025-05-28T19:30:37Z

@aquarat Do you experience the same issue as @sempervictus where it doesn't recognize the GPUs?

gkpln3 · 2025-05-28T19:41:13Z

@sempervictus I just realized that we might have made a small mistake with our env vars tests.
I've just had a look at Ollama's Dockerfile, the lib directory is at /usr/lib/ollama and not /usr/local/lib/ollama, the env vars that we tried was using /usr/local, and also, I've made a small mistake with one of the env vars name.

Lets give it another try with this:

docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/lib/ollama ollama rpc --device list

asaiacai · 2025-05-28T20:53:29Z

the GPUs get detected now using after adjusting the env vars under ggml_cuda_init but nothing appears under available_devices

root@trainy:~# docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/lib/ollama ollama rpc --device list
2025/05/28 20:50:00 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 1: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 2: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 3: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 4: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 5: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 6: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 7: NVIDIA B200, compute capability 10.0, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
2025/05/28 20:50:07 ggml.go:105: INFO system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
2025/05/28 20:50:07 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

When I specify the device as CUDA0 it looks like the GPU backend is being created. How would I specify all GPUs here with the --device CLI argument?

root@trainy:~# docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/lib/ollama ollama rpc --device CUDA0
2025/05/28 20:50:30 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 1: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 2: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 3: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 4: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 5: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 6: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 7: NVIDIA B200, compute capability 10.0, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
2025/05/28 20:50:37 ggml.go:105: INFO system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
2025/05/28 20:50:37 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CUDA0 backend

gkpln3 · 2025-05-29T21:26:49Z

@asaiacai Unfortunately, the RPC currently emulates a single GPU per instance, so multi-GPU support isn't available out of the box. You can run a separate RPC server for each GPU and connect them all, but it's not the most straightforward setup. Hopefully, we'll implement a better solution in the future.

sempervictus · 2025-05-29T23:55:04Z

@asaiacai Unfortunately, the RPC currently emulates a single GPU per instance, so multi-GPU support isn't available out of the box. You can run a separate RPC server for each GPU and connect them all, but it's not the most straightforward setup. Hopefully, we'll implement a better solution in the future.

Since we're wrapping the RPC anyway, this is probably the layer at which that logic would make the most sense to implement. If doing just a discovery and init wrapper, probably something along the lines of enumerating GPUs and assigning a port incrementing from the base to each one then printing that for consumers to get... i'm sure we can automate that by explicit device definition at call time as well but figure it might be "slicker" if actually handled by the bin.

sempervictus · 2025-05-31T21:17:52Z

@gkpln3 Using individual RPC servers and feeding those to the ollama instance taking API requests seems to present the RPC services as generic targets, without the driver info we see on the local ones (there are 2x 4-way V100 SXM hosts):

ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-W library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.1 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-X library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Y library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Z library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51020 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51021 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="31.1 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51022 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="30.8 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51023 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="30.5 GiB"

distribution does occur but not between local and remote GPUs but CPU and RPC instead. Should we have each card in the SXM hosts on its own RPC server with a single front-end aggregating those? Is there a way to push more layers to the RPC targets?

load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 4 repeating layers to GPU
load_tensors: offloaded 4/65 layers to GPU
load_tensors: RPC[10.0.0.2:51020] model buffer size =  1019.02 MiB
load_tensors: RPC[10.0.0.2:51021] model buffer size =  1019.02 MiB
load_tensors: RPC[10.0.0.2:51022] model buffer size =  1019.02 MiB
load_tensors: RPC[10.0.0.2:51023] model buffer size =  1019.02 MiB
load_tensors:          CPU model buffer size = 59938.92 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 50000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     2.05 MiB
llama_kv_cache_unified: kv_size = 32768, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1, padding = 32
llama_kv_cache_unified: RPC[10.0.0.2:51020] KV buffer size =   128.00 MiB
llama_kv_cache_unified: RPC[10.0.0.2:51021] KV buffer size =   128.00 MiB
llama_kv_cache_unified: RPC[10.0.0.2:51022] KV buffer size =   128.00 MiB
llama_kv_cache_unified: RPC[10.0.0.2:51023] KV buffer size =   128.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =  7680.00 MiB
llama_kv_cache_unified: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
llama_context: RPC[10.0.0.2:51020] compute buffer size =  6304.00 MiB
llama_context: RPC[10.0.0.2:51021] compute buffer size =  6304.00 MiB
llama_context: RPC[10.0.0.2:51022] compute buffer size =  6304.00 MiB
llama_context: RPC[10.0.0.2:51023] compute buffer size =  6280.00 MiB
llama_context:        CPU compute buffer size =  6368.01 MiB
llama_context: graph nodes  = 2024
llama_context: graph splits = 6

Separately - what's the correct compose invocation to run these one-off RPC instances?

    command:
      - rpc --device CUDA0 --port 51020

throws Error: unknown command "rpc --device CUDA0 --port 51020" for "ollama" where as the same string being passed to a docker run invocation appears to work fine. Peculiarity of binary/called verb or something off w/ compose?

sempervictus · 2025-05-31T22:51:00Z

Seems the RPC services are somewhat unstable -

time=2025-05-31T22:33:45.785Z level=INFO source=routes.go:1259 msg="Listening on [::]:11434 (version 0.0.0)"
time=2025-05-31T22:33:45.785Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-31T22:33:51.325Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51020
time=2025-05-31T22:33:56.330Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51021
time=2025-05-31T22:34:01.334Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51022
time=2025-05-31T22:34:06.335Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51023
time=2025-05-31T22:34:06.335Z level=INFO source=types.go:139 msg="inference compute"

... local GPUs
Tried --net host just to avoid docker nonsense and nmap shows the ports as listening:

51020/tcp open  unknown
51021/tcp open  unknown
51022/tcp open  unknown
51023/tcp open  unknown

no error logs in the RPC server output or ollama instance connecting to them.

ecyht2 · 2025-06-07T07:37:00Z

no error logs in the RPC server output or ollama instance connecting to them.

Try enabling the debug logs. By setting OLLAMA_DEBUG to 1.

ecyht2 · 2025-06-07T07:42:23Z

    command:
      - rpc --device CUDA0 --port 51020
throws Error: unknown command "rpc --device CUDA0 --port 51020" for "ollama" where as the same string being passed to a docker run invocation appears to work fine. Peculiarity of binary/called verb or something off w/ compose?

Seems like docker-compose is passing everything as a single string. Try splitting it up like:

    command:
      - rpc
      - --device
      - CUDA0
      - --port
      - 51020

ecyht2 and others added 30 commits October 13, 2024 09:46

feat: Added support for llama.cpp RPC

7015fce

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

doc: Added documentation for distributed inferencing

c234eea

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

feat: Added Memory Check for RPC Servers

bf80325

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

feat: Added option to change RPC servers in HTTP options

ed78baa

doc: Added docs for new API options

4f76c4b

doc: Updated request for changing RPC server to be generate instead o…

056bd69

…f chat

Merge remote-tracking branch 'upstream/main' into feat/rpc

b177dcf

Merge remote-tracking branch 'upstream/main' into feat/rpc

8068cd1

server/sched.go: Fixed missing legacy gpu module

ca5c567

dicover/gpu.go: Updated RPC communication to support new protocol

f645eec

CMakeLists.txt: Added RPC support for ggml

8e0a4e9

runner/llamarunner/runner.go,runner/ollamarunner/runner.go: Added rpc…

30a612d

… support for runners

Merged with main

6828e89

docs/distributed_inferencing.md: Fix spelling mistakes and grammar

a0c0891

ml/backend/ggml/ggml/src/ggml-rpc/rpc.go: Fix missing std::filesystem

03d6545

llm/server.go: Fixed merge conflicts with main

ac1370d

Implement RPC memory retrieval and server validation in gpu_darwin.go

10a87df

merge with main

dce1241

Add RPC server implementation and CLI command for distributed inference

6d9690f

Its working!

39b95cf

moved shared rpc logic to single file.

69eec93

fixed bug

8df2389

Merge branch 'main' of https://github.com/ollama/ollama into feat/rpc

4930bfd

refactor: streamline GPU and RPC server handling in discovery and ser…

30ff7b1

…ver components

feat: add device flag to RPC server and update run_rpc_server signature

0886f33

docs: update distributed inferencing documentation for clarity and st…

7ead29c

…ructure

Merge branch 'feat/rpc' of https://github.com/ecyht2/ollama into feat…

dae7e01

…/rpc

Improved docs

97fb1f3

feat: add OLLAMA_RPC_SERVERS config and update cache directory handling

0e91a60

feat: add OLLAMA_RPC_SERVERS to environment variables and fix bug wit…

101f207

…h RPC on certain models

Initialize ggml backend on rpc server.

84aa6d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844

discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844

Are you sure you want to change the base?

discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!