8000 discover/gpu.go: Add Support for Distributed Inferencing (continued) by gkpln3 · Pull Request #10844 · ollama/ollama · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

gkpln3
Copy link
@gkpln3 gkpln3 commented May 24, 2025

This PR builds on top of the work done by @ecyht2 on #6729, following issue #4643.
It aims to add RPC support to Ollama based on llama.cpp RPC mechanism to allow distributed inference across multiple devices.

This PR has been tested and confirmed working on MacOS (fixing a race condition in distributed inference). best performance can be achieved by connecting the devices using Thunderbolt 4.

This PR also adds the ollama rpc command that allows running the RPC server on the other computer.

ecyht2 and others added 30 commits October 13, 2024 09:46
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
@sempervictus
Copy link
sempervictus commented May 26, 2025

To clarify: same exact container built from this PR starts with GPUs just fine in its default mode but in rpc ... it only presents CPU resources. The same container running as a client of the RPC only detects CPU resources available.

That said ... it does find multiple RPC server backend and their general-purpose memory 😄

@sempervictus
Copy link
sempervictus commented May 26, 2025

@gkpln3 (and probably @rick-github + team given the potential visibility of this track) - the team at WhiteFiber have graciously offered a couple of 8-way B200 hosts and possibly some H200s on various fabrics to help test once we get the basics going (the people who run the place built what was probably the first GPUaaS cloud, they're very FOSS-friendly) so we may have a chance to see just how far this rocket ship goes in practical terms. I know that RPC isnt exactly as efficient as zero-copy RDMA but 4Tbit of interconnect is nothing at which to scoff (all 400G cards, 2x for the front-ends, 8 paired w/ GPUs: the fabrics are designed and implemented converged for maximum flexibility - this use-case being a good example of "why").

@gkpln3
Copy link
Author
gkpln3 commented May 26, 2025

@gkpln3 (and probably @rick-github + team given the potential visibility of this track) - the team at WhiteFiber have graciously offered a couple of 8-way B200 hosts and possibly some H200s on various fabrics to help test once we get the basics going (the people who run the place built what was probably the first GPUaaS cloud, they're very FOSS-friendly) so we may have a chance to see just how far this rocket ship goes in practical terms. I know that RPC isnt exactly as efficient as zero-copy RDMA but 4Tbit of interconnect is nothing at which to scoff (all 400G cards, 2x for the front-ends, 8 paired w/ GPUs: the fabrics are designed and implemented converged for maximum flexibility - this use-case being a good example of "why").

This is amazing! thanks!

To clarify: same exact container built from this PR starts with GPUs just fine in its default mode but in rpc ... it only presents CPU resources. The same container running as a client of the RPC only detects CPU resources available.

That said ... it does find multiple RPC server backend and their general-purpose memory 😄

Can you try running this?

ollama rpc --device list

it should print out all the devices its recognizing.

@sempervictus
Copy link
sempervictus commented May 26, 2025

It is not happy doing that :-\

 docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --device list --port 50053
2025/05/26 18:26:03 rpc_server.go:25: Starting RPC server on 0.0.0.0:50053
2025/05/26 18:26:03 rpc_server.go:34: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:


with no further output.

In case its relevant, the backing devices are 4x32G V100s on an SXM2 so v7 compatibility level, and obviously work on just ollama serve invocation.

@sempervictus
Copy link

@gkpln3 (and probably @rick-github + team given the potential visibility of this track) - the team at WhiteFiber have graciously offered a couple of 8-way B200 hosts and possibly some H200s on various fabrics to help test once we get the basics going (the people who run the place built what was probably the first GPUaaS cloud, they're very FOSS-friendly) so we may have a chance to see just how far this rocket ship goes in practical terms. I know that RPC isnt exactly as efficient as zero-copy RDMA but 4Tbit of interconnect is nothing at which to scoff (all 400G cards, 2x for the front-ends, 8 paired w/ GPUs: the fabrics are designed and implemented converged for maximum flexibility - this use-case being a good example of "why").

This is amazing! thanks!

To clarify: same exact container built from this PR starts with GPUs just fine in its default mode but in rpc ... it only presents CPU resources. The same container running as a client of the RPC only detects CPU resources available.
That said ... it does find multiple RPC server backend and their general-purpose memory 😄

Can you try running this?

ollama rpc --device list

it should print out all the devices its recognizing.

@asaiacai is spinning up and prepping the B200s; we'll run a single interface @ 400G for the socket test for the time being and can work out the multi-NIC thing and RDMA down the line but i imagine everyone's curious about the implications of chaining hardware together even in such a user-friendly modality.

@gkpln3 if you have any specific models, configurations, etc you'd like set up for the test please feel free to ask @asaiacai as we go.

Once we have the code built there we should also be quickly able to tell if this is an artifact of the stack on which i am testing though i do think that what i am seeing re "no GPUs" is a bug. :-)

@gkpln3
Copy link
Author
gkpln3 commented May 26, 2025

Got it, let me check about the GPU identification thing.

@sempervictus
Copy link

Rebuilding :-)

@gkpln3
Copy link
Author
gkpln3 commented May 26, 2025

Ok, I think I found the issue. seems like i wasn't initializing the ggml backend properly when loading the rpc server.
The fix I pushed might be a bit incomplete, ollama has some more complex library choosing flow that I've yet to add, but I think the fixed I pushed will fix the issue.

Lets see :)

@sempervictus
Copy link
sempervictus commented May 26, 2025

No dice on 84aa6d0 unfortunately @gkpln3:

$ docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --device list --port 50053
2025/05/26 19:26:07 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 19:26:07 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 19:26:07 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:
^C2025/05/26 19:26:22 rpc_server.go:44: Shutting down RPC server...

^C
$ docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --port 50053
2025/05/26 19:31:53 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 19:31:53 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 19:31:53 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CPU backend

any chance that Docker skipped the relevant build step?

 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                       0.0s
 => => transferring dockerfile: 5.87kB                                                                                                                                                                                                                                     0.0s
 => WARN: FromPlatformFlagConstDisallowed: FROM --platform flag should not use constant value "linux/arm64" (line 66)                                                                                                                                                      0.0s
 => WARN: FromPlatformFlagConstDisallowed: FROM --platform flag should not use constant value "linux/arm64" (line 77)                                                                                                                                                      0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                                                                                                                                                            0.2s
 => [internal] load metadata for docker.io/rocm/dev-almalinux-8:6.3.3-complete                                                                                                                                                                                             0.2s
 => [internal] load .dockerignore                                                                                                                                                                                                                                          0.0s
 => => transferring context: 105B                                                                                                                                                                                                                                          0.0s
 => [stage-14 1/4] FROM docker.io/library/ubuntu:20.04@sha256:8feb4d8ca5354def3d8fce243717141ce31e2c428701f6682bd2fafe15388214                                                                                                                                             0.0s
 => [internal] load build context                                                                                                                                                                                                                                          0.0s
 => => transferring context: 62.48kB                                                                                                                                                                                                                                       0.0s
 => [base-amd64 1/2] FROM docker.io/rocm/dev-almalinux-8:6.3.3-complete@sha256:9b73d5d6c04f685b179fd6e24569f578de830a47f9ab69e99dd901f5e2aca184                                                                                                                            0.0s
 => CACHED [base-amd64 2/2] RUN yum install -y yum-utils     && yum-config-manager --add-repo https://dl.rockylinux.org/vault/rocky/8.5/AppStream/$basearch/os/     && rpm --import https://dl.rockylinux.org/pub/rocky/RPM-GPG-KEY-Rocky-8     && dnf install -y yum-uti  0.0s
 => CACHED [base 1/3] RUN curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.31.2/cmake-3.31.2-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1                                                                                          0.0s
 => CACHED [base 2/3] COPY CMakeLists.txt CMakePresets.json .                                                                                                                                                                                                              0.0s
 => CACHED [base 3/3] COPY ml/backend/ggml/ggml ml/backend/ggml/ggml                                                                                                                                                                                                       0.0s
 => CACHED [build 1/6] WORKDIR /go/src/github.com/ollama/ollama                                                                                                                                                                                                            0.0s
 => CACHED [build 2/6] COPY go.mod go.sum .                                                                                                                                                                                                                                0.0s
 => CACHED [build 3/6] RUN curl -fsSL https://golang.org/dl/go$(awk '/^go/ { print $2 }' go.mod).linux-$(case $(uname -m) in x86_64) echo amd64 ;; aarch64) echo arm64 ;; esac).tar.gz | tar xz -C /usr/local                                                              0.0s
 => CACHED [build 4/6] RUN go mod download                                                                                                                                                                                                                                 0.0s
 => [build 5/6] COPY . .                                                                                                                                                                                                                                                   0.2s
 => [build 6/6] RUN --mount=type=cache,target=/root/.cache/go-build     go build -trimpath -buildmode=pie -o /bin/ollama .                                                                                                                                                 2.5s
 => CACHED [cuda-11 1/2] RUN dnf install -y cuda-toolkit-11-3                                                                                                                                                                                                              0.0s
 => CACHED [cuda-11 2/2] RUN --mount=type=cache,target=/root/.ccache     cmake --preset 'CUDA 11'         && cmake --build --parallel --preset 'CUDA 11'         && cmake --install build --component CUDA --strip --parallel 8                                            0.0s
 => CACHED [amd64 1/2] COPY --from=cuda-11 dist/lib/ollama/cuda_v11 /lib/ollama/cuda_v11                                                                                                                                                                                   0.0s
 => CACHED [cuda-12 1/2] RUN dnf install -y cuda-toolkit-12-8                                                                                                                                                                                                              0.0s
 => CACHED [cuda-12 2/2] RUN --mount=type=cache,target=/root/.ccache     cmake --preset 'CUDA 12'         && cmake --build --parallel --preset 'CUDA 12'         && cmake --install build --component CUDA --strip --parallel 8                                            0.0s
 => CACHED [amd64 2/2] COPY --from=cuda-12 dist/lib/ollama/cuda_v12 /lib/ollama/cuda_v12                                                                                                                                                                                   0.0s
 => CACHED [cpu 1/2] RUN dnf install -y gcc-toolset-11-gcc gcc-toolset-11-gcc-c++                                                                                                                                                                                          0.0s
 => CACHED [cpu 2/2] RUN --mount=type=cache,target=/root/.ccache     cmake --preset 'CPU'         && cmake --build --parallel --preset 'CPU'         && cmake --install build --component CPU --strip --parallel 8                                                         0.0s
 => CACHED [archive 1/2] COPY --from=cpu dist/lib/ollama /lib/ollama                                                                                                                                                                                                       0.0s
 => [archive 2/2] COPY --from=build /bin/ollama /bin/ollama                                                                                                                                                                                                                0.2s
 => CACHED [stage-14 2/4] RUN apt-get update     && apt-get install -y ca-certificates     && apt-get clean     && rm -rf /var/lib/apt/lists/*                                                                                                                             0.0s
 => [stage-14 3/4] COPY --from=archive /bin /usr/bin                                                                                                                                                                                                                       0.2s
 => [stage-14 4/4] COPY --from=archive /lib/ollama /usr/lib/ollama                                                                                                                                                                                                        20.9s
 => exporting to image                                                                                                                                                                                                                                                    12.6s
 => => exporting layers                                                                                                                                                                                                                                                   12.5s
 => => writing image sha256:4c54c55a114a44c5eb75f7444280fa4c9e4e836d42733df084c709bff7379ddf                                                                                                                                                                               0.0s
 => => naming to docker.io/library/ollama:mnt 

@gkpln3
Copy link
Author
gkpln3 commented May 26, 2025

any chance that Docker skipped the relevant build step?

I think not, the relevant one is:

 => [build 6/6] RUN --mount=type=cache,target=/root/.cache/go-build     go build -trimpath -buildmode=pie -o /bin/ollama .                                                                                                                                                 2.5s

and it seems like it ran it.

@sempervictus
Copy link

Can confirm that passing serve to the same binary in the container picks up the 4 V100s. So somewhere in the initialization routine the two paths diverge... Oh GoLang, how i wish you were Rust ;-).

Will start digging into the code this evening if the bug remains ellusive; and if the same issue is apparent on the B200s then @asaiacai should have an independent build & test stack up on that hardware shortly so we can validate whether this is confined to specific HW/compatibility rev.

@asaiacai
Copy link

seeing similar errors on my B200 setup.

root@trainy:~/ollama# docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama rpc --device list --port 50053
2025/05/26 20:10:59 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 20:10:59 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 20:10:59 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

@gkpln3
Copy link
Author
gkpln3 commented May 26, 2025

@sempervictus I'm pretty sure its related to the libraries that ollama passes to the runner process that we're not having in the ollama rpc command.
You can check the env vars on the ollama runner process after you run a model to see what env vars you get there that I'm missing in ollama rpc.

I can keep working on that later, I have to go for now

@sempervictus
Copy link

@gkpln3 - apologies for the lag, had to step out for a bit. In serve mode we get:

time=2025-05-26T23:26:09.202Z level=INFO source=routes.go:1206 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_RPC_SERVERS: OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-05-26T23:26:09.202Z level=INFO source=images.go:463 msg="total blobs: 0"
time=2025-05-26T23:26:09.202Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ol
F438
lama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-05-26T23:26:09.202Z level=INFO source=routes.go:1259 msg="Listening on [::]:11434 (version 0.0.0)"
time=2025-05-26T23:26:09.203Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-26T23:26:12.540Z level=INFO source=types.go:139 msg="inference compute" id=GPU-W library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-26T23:26:12.541Z level=INFO source=types.go:139 msg="inference compute" id=GPU-X library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-26T23:26:12.541Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Y library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
time=2025-05-26T23:26:12.541Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Z library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"

i tried feeding the "relevant-seeming" ones into the rpc call, to no avail:

 docker run --rm --name ollama_test --gpus=all -p 50053:50053 -e CUDA_VISIBLE_DEVICES=all -e OLLAMA_CONTEXT_LENGTH=4096 -e OLLAMA_HOST:http://0.0.0.0:11434 -e OLLAMA_INTEL_GPU=false ollama:mnt rpc --device list
2025/05/26 23:32:56 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
2025/05/26 23:32:56 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 23:32:56 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

@gkpln3
Copy link
Author
gkpln3 commented May 27, 2025

@sempervictus I think the env vars you sent belong to the ollama serve process. ollama runner is a different process that ollama spins up to handle a specific model run.
try running a model, then run

ps -e | grep ollama runner

take the pid of the process you find, and take the value of

cat /proc/{pid}/environ

@sempervictus
Copy link

Apologies, here's the subprocess output for a codegeex4 runner:

NVIDIA_VISIBLE_DEVICES=allPYTHON_SHA256=849da87af4df137710c1796e276a955f7a85c9f971081067c8f565d15c352a09HOSTNAME=8733e33b9343PYTHON_VERSION=3.11.12WHISPER_MODEL=baseOPENAI_API_KEY=ENV=prodANONYMIZED_TELEMETRY=falsePWD=/app/backendOPENAI_API_BASE_URL=PORT=8080WEBUI_SECRET_KEY=OTsiSmG8OAKX+vxyOLLAMA_BASE_URL=/ollamaHOME=/rootLANG=C.UTF-8USE_EMBEDDING_MODEL_DOCKER=sentence-transformers/all-MiniLM-L6-v2SENTENCE_TRANSFORMERS_HOME=/app/backend/data/cache/embedding/modelsUSE_CUDA_DOCKER=falseGPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696DUSE_CUDA_DOCKER_VER=cu128USE_RERANKING_MODEL_DOCKER=USE_OLLAMA_DOCKER=trueCUDA_VISIBLE_DEVICES=GPU-UUIDHF_HOME=/app/backend/data/cache/embedding/modelsSCARF_NO_ANALYTICS=trueDO_NOT_TRACK=trueRAG_RERANKING_MODEL=SHLVL=0DOCKER=trueOLLAMA_MODEL_PARALLEL=8WHISPER_MODEL_DIR=/app/backend/data/cache/whisper/modelsTIKTOKEN_ENCODING_NAME=cl100k_baseRAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binTIKTOKEN_CACHE_DIR=/app/backend/data/cache/tiktokenOLDPWD=/app/backendWEBUI_BUILD_VERSION=737dc7797ce5eb0f3dbe08bd1ab6778ea59f538e_=/usr/local/bin/ollamaOLLAMA_MAX_LOADED_MODELS=12OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama

Interestingly it specifies CUDA_VISIBLE_DEVICES with the GPU UUID but NVIDIA_VISIBLE_DEVICES=all

@gkpln3
Copy link
Author
gkpln3 commented May 27, 2025

Apologies, here's the subprocess output for a codegeex4 runner:

XXXXXXX

Interestingly it specifies CUDA_VISIBLE_DEVICES with the GPU UUID but NVIDIA_VISIBLE_DEVICES=all

Can you try running rpc with these env vars?

docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e NVIDIA_VISIBLE_DEVICES=all -e OLLAMA_BASE_URL=/ollama -e USE_EMBEDDING_MODEL_DOCKER=sentence-transformers/all-MiniLM-L6-v2 -e SENTENCE_TRANSFORMERS_HOME=/app/backend/data/cache/embedding/models -e USE_CUDA_DOCKER=false -e USE_CUDA_DOCKER_VER=cu128 -e USE_RERANKING_MODEL_DOCKER= -e USE_OLLAMA_DOCKER=true -e CUDA_VISIBLE_DEVICES=GPU-UUID -e HF_HOME=/app/backend/data/cache/embedding/models -e OLLAMA_MODEL_PARALLEL=8 -e WHISPER_MODEL_DIR=/app/backend/data/cache/whisper/models -e TIKTOKEN_ENCODING_NAME=cl100k_base -e RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 -e TIKTOKEN_CACHE_DIR=/app/backend/data/cache/tiktoken -e OLLAMA_MAX_LOADED_MODELS=12O -e LLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama ollama rpc --device list

Also, I suggest you remove the env vars from the comment, it contains some additional values that shouldn't be public 😅

@sempervictus
Copy link

Appreciate the heads up - those are default values in a compose file (from a GH repo somewhere IIRC) which only exist if we don't supply overriding env vars when bringing up. Ours are fed by devops when used in prod so those aren't in actual use :-).

On the v12 forced run attempt - same effect. Cleared all ollama images, rebuilt the local one from this PR, and still get:

docker run --rm --name ollama_test --gpus=all -p 50053:50053 -e NVIDIA_VISIBLE_DEVICES=all -e OLLAMA_BASE_URL=/ollama -e USE_EMBEDDING_MODEL_DOCKER=sentence-transformers/all-MiniLM-L6-v2 -e SENTENCE_TRANSFORMERS_HOME=/app/backend/data/cache/embedding/models -e USE_CUDA_DOCKER=false -e USE_CUDA_DOCKER_VER=cu128 -e USE_RERANKING_MODEL_DOCKER= -e USE_OLLAMA_DOCKER=true -e CUDA_VISIBLE_DEVICES=GPU-UUID -e HF_HOME=/app/backend/data/cache/embedding/models -e OLLAMA_MODEL_PARALLEL=8 -e WHISPER_MODEL_DIR=/app/backend/data/cache/whisper/models -e TIKTOKEN_ENCODING_NAME=cl100k_base -e RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 -e TIKTOKEN_CACHE_DIR=/app/backend/data/cache/tiktoken -e OLLAMA_MAX_LOADED_MODELS=12O -e LLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama/cuda_v12:/usr/local/lib/ollama:/usr/local/lib/ollama  ollama:mnt rpc --device list
2025/05/27 18:10:21 config.go:210: WARN invalid environment variable, using default key=OLLAMA_MAX_LOADED_MODELS value=12O default=0
2025/05/27 18:10:21 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
2025/05/27 18:10:21 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/27 18:10:21 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:


@gkpln3
Copy link
Author
gkpln3 commented May 28, 2025

@sempervictus Thanks for the output, seems like it would require some extra investigation 😅
I'll update here when a solution is found

@aquarat
Copy link
aquarat commented May 28, 2025

I found that if I clone llama.cpp directly and checkout the commit in Makefile.sync and then build the RPC server, it doesn't seem to consistently work with this Ollama - but at least it does sometimes work 🎉

I get errors like this:

Accepted client connection, free_mem=25031671808, total_mem=25307578368
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 49293.35 MiB on device 0: cudaMalloc failed: out of memory
[alloc_buffer] size: 51687819264 -> failed

The target device has 24GBs of VRAM.

It does seem to work when I use "qwen3:235b-a22b-q4_K_M". It's quite exciting when it does work 😄

@gkpln3
Copy link
Author
gkpln3 commented May 28, 2025

@aquarat Do you experience the same issue as @sempervictus where it doesn't recognize the GPUs?

@gkpln3
Copy link
Author
gkpln3 commented May 28, 2025

@sempervictus I just realized that we might have made a small mistake with our env vars tests.
I've just had a look at Ollama's Dockerfile, the lib directory is at /usr/lib/ollama and not /usr/local/lib/ollama, the env vars that we tried was using /usr/local, and also, I've made a small mistake with one of the env vars name.

Lets give it another try with this:

docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/lib/ollama ollama rpc --device list

@asaiacai
Copy link

the GPUs get detected now using after adjusting the env vars under ggml_cuda_init but nothing appears under available_devices

root@trainy:~# docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/lib/ollama ollama rpc --device list
2025/05/28 20:50:00 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 1: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 2: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 3: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 4: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 5: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 6: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 7: NVIDIA B200, compute capability 10.0, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
2025/05/28 20:50:07 ggml.go:105: INFO system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
2025/05/28 20:50:07 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

error: unknown device: list
available devices:

When I specify the device as CUDA0 it looks like the GPU backend is being created. How would I specify all GPUs here with the --device CLI argument?

root@trainy:~# docker run --rm --name ollama_test --gpus=all -p 50052:50052 -e OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 -e LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/lib/ollama ollama rpc --device CUDA0
2025/05/28 20:50:30 rpc_server.go:26: Starting RPC server on 0.0.0.0:50052
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 1: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 2: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 3: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 4: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 5: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 6: NVIDIA B200, compute capability 10.0, VMM: yes
  Device 7: NVIDIA B200, compute capability 10.0, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
2025/05/28 20:50:37 ggml.go:105: INFO system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
2025/05/28 20:50:37 rpc_server.go:37: RPC server started on 0.0.0.0:50052. Press Ctrl+C to exit.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CUDA0 backend

@gkpln3
Copy link
Author
gkpln3 commented May 29, 2025

@asaiacai Unfortunately, the RPC currently emulates a single GPU per instance, so multi-GPU support isn't available out of the box. You can run a separate RPC server for each GPU and connect them all, but it's not the most straightforward setup. Hopefully, we'll implement a better solution in the future.

@sempervictus
Copy link

@asaiacai Unfortunately, the RPC currently emulates a single GPU per instance, so multi-GPU support isn't available out of the box. You can run a separate RPC server for each GPU and connect them all, but it's not the most straightforward setup. Hopefully, we'll implement a better solution in the future.

Since we're wrapping the RPC anyway, this is probably the layer at which that logic would make the most sense to implement. If doing just a discovery and init wrapper, probably something along the lines of enumerating GPUs and assigning a port incrementing from the base to each one then printing that for consumers to get... i'm sure we can automate that by explicit device definition at call time as well but figure it might be "slicker" if actually handled by the bin.

@sempervictus
Copy link
sempervictus commented May 31, 2025

@gkpln3 Using individual RPC servers and feeding those to the ollama instance taking API requests seems to present the RPC services as generic targets, without the driver info we see on the local ones (there are 2x 4-way V100 SXM hosts):

ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-W library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.1 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-X library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Y library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=GPU-Z library=cuda variant=v12 compute=7.0 driver=12.8 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51020 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="31.4 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51021 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="31.1 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51022 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="30.8 GiB"
ollama  | time=2025-05-31T21:03:53.877Z level=INFO source=types.go:139 msg="inference compute" id=10.0.0.2:51023 library=rpc variant="" compute="" driver=0.0 name="" total="31.7 GiB" available="30.5 GiB"

distribution does occur but not between local and remote GPUs but CPU and RPC instead. Should we have each card in the SXM hosts on its own RPC server with a single front-end aggregating those? Is there a way to push more layers to the RPC targets?

load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 4 repeating layers to GPU
load_tensors: offloaded 4/65 layers to GPU
load_tensors: RPC[10.0.0.2:51020] model buffer size =  1019.02 MiB
load_tensors: RPC[10.0.0.2:51021] model buffer size =  1019.02 MiB
load_tensors: RPC[10.0.0.2:51022] model buffer size =  1019.02 MiB
load_tensors: RPC[10.0.0.2:51023] model buffer size =  1019.02 MiB
load_tensors:          CPU model buffer size = 59938.92 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 50000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     2.05 MiB
llama_kv_cache_unified: kv_size = 32768, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1, padding = 32
llama_kv_cache_unified: RPC[10.0.0.2:51020] KV buffer size =   128.00 MiB
llama_kv_cache_unified: RPC[10.0.0.2:51021] KV buffer size =   128.00 MiB
llama_kv_cache_unified: RPC[10.0.0.2:51022] KV buffer size =   128.00 MiB
llama_kv_cache_unified: RPC[10.0.0.2:51023] KV buffer size =   128.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =  7680.00 MiB
llama_kv_cache_unified: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
llama_context: RPC[10.0.0.2:51020] compute buffer size =  6304.00 MiB
llama_context: RPC[10.0.0.2:51021] compute buffer size =  6304.00 MiB
llama_context: RPC[10.0.0.2:51022] compute buffer size =  6304.00 MiB
llama_context: RPC[10.0.0.2:51023] compute buffer size =  6280.00 MiB
llama_context:        CPU compute buffer size =  6368.01 MiB
llama_context: graph nodes  = 2024
llama_context: graph splits = 6

Separately - what's the correct compose invocation to run these one-off RPC instances?

    command:
      - rpc --device CUDA0 --port 51020

throws Error: unknown command "rpc --device CUDA0 --port 51020" for "ollama" where as the same string being passed to a docker run invocation appears to work fine. Peculiarity of binary/called verb or something off w/ compose?

@sempervictus
Copy link

Seems the RPC services are somewhat unstable -

time=2025-05-31T22:33:45.785Z level=INFO source=routes.go:1259 msg="Listening on [::]:11434 (version 0.0.0)"
time=2025-05-31T22:33:45.785Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-31T22:33:51.325Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51020
time=2025-05-31T22:33:56.330Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51021
time=2025-05-31T22:34:01.334Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51022
time=2025-05-31T22:34:06.335Z level=WARN source=gpu_rpc.go:177 msg="unable to connect to endpoint" endpoint=10.0.0.2:51023
time=2025-05-31T22:34:06.335Z level=INFO source=types.go:139 msg="inference compute"

... local GPUs
Tried --net host just to avoid docker nonsense and nmap shows the ports as listening:

51020/tcp open  unknown
51021/tcp open  unknown
51022/tcp open  unknown
51023/tcp open  unknown

no error logs in the RPC server output or ollama instance connecting to them.

@ecyht2
Copy link
ecyht2 commented Jun 7, 2025

no error logs in the RPC server output or ollama instance connecting to them.

Try enabling the debug logs. By setting OLLAMA_DEBUG to 1.

@ecyht2
Copy link
ecyht2 commented Jun 7, 2025
    command:
      - rpc --device CUDA0 --port 51020

throws Error: unknown command "rpc --device CUDA0 --port 51020" for "ollama" where as the same string being passed to a docker run invocation appears to work fine. Peculiarity of binary/called verb or something off w/ compose?

Seems like docker-compose is passing everything as a single string. Try splitting it up like:

    command:
      - rpc
      - --device
      - CUDA0
      - --port
      - 51020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0