-
Notifications
You must be signed in to change notification settings - Fork 12.2k
discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
… support for runners
…h RPC on certain models
To clarify: same exact container built from this PR starts with GPUs just fine in its default mode but in That said ... it does find multiple RPC server backend and their general-purpose memory 😄 |
@gkpln3 (and probably @rick-github + team given the potential visibility of this track) - the team at WhiteFiber have graciously offered a couple of 8-way B200 hosts and possibly some H200s on various fabrics to help test once we get the basics going (the people who run the place built what was probably the first GPUaaS cloud, they're very FOSS-friendly) so we may have a chance to see just how far this rocket ship goes in practical terms. I know that RPC isnt exactly as efficient as zero-copy RDMA but 4Tbit of interconnect is nothing at which to scoff (all 400G cards, 2x for the front-ends, 8 paired w/ GPUs: the fabrics are designed and implemented converged for maximum flexibility - this use-case being a good example of "why"). |
This is amazing! thanks!
Can you try running this?
it should print out all the devices its recognizing. |
It is not happy doing that :-\
with no further output. In case its relevant, the backing devices are 4x32G V100s on an SXM2 so v7 compatibility level, and obviously work on just |
@asaiacai is spinning up and prepping the B200s; we'll run a single interface @ 400G for the socket test for the time being and can work out the multi-NIC thing and RDMA down the line but i imagine everyone's curious about the implications of chaining hardware together even in such a user-friendly modality. @gkpln3 if you have any specific models, configurations, etc you'd like set up for the test please feel free to ask @asaiacai as we go. Once we have the code built there we should also be quickly able to tell if this is an artifact of the stack on which i am testing though i do think that what i am seeing re "no GPUs" is a bug. :-) |
Got it, let me check about the GPU identification thing. |
Rebuilding :-) |
Ok, I think I found the issue. seems like i wasn't initializing the ggml backend properly when loading the rpc server. Lets see :) |
No dice on 84aa6d0 unfortunately @gkpln3: $ docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --device list --port 50053
2025/05/26 19:26:07 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 19:26:07 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 19:26:07 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
Never expose the RPC server to an open network!
This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
error: unknown device: list
available devices:
^C2025/05/26 19:26:22 rpc_server.go:44: Shutting down RPC server...
^C
$ docker run --rm --name ollama_test --gpus=all -p 50053:50053 ollama:mnt rpc --port 50053
2025/05/26 19:31:53 rpc_server.go:26: Starting RPC server on 0.0.0.0:50053
2025/05/26 19:31:53 ggml.go:105: INFO system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
2025/05/26 19:31:53 rpc_server.go:37: RPC server started on 0.0.0.0:50053. Press Ctrl+C to exit.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
Never expose the RPC server to an open network!
This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
create_backend: using CPU backend
any chance that Docker skipped the relevant build step?
|
I think not, the relevant one is:
and it seems like it ran it. |
Can confirm that passing Will start digging into the code this evening if the bug remains ellusive; and if the same issue is apparent on the B200s then @asaiacai should have an independent build & test stack up on that hardware shortly so we can validate whether this is confined to specific HW/compatibility rev. |
seeing similar errors on my
|
@sempervictus I'm pretty sure its related to the libraries that ollama passes to the runner process that we're not having in the I can keep working on that later, I have to go for now |
@gkpln3 - apologies for the lag, had to step out for a bit. In
i tried feeding the "relevant-seeming" ones into the
|
@sempervictus I think the env vars you sent belong to the
take the
|
Apologies, here's the subprocess output for a codegeex4 runner:
Interestingly it specifies |
Can you try running rpc with these env vars?
Also, I suggest you remove the env vars from the comment, it contains some additional values that shouldn't be public 😅 |
Appreciate the heads up - those are default values in a compose file (from a GH repo somewhere IIRC) which only exist if we don't supply overriding env vars when bringing up. Ours are fed by devops when used in prod so those aren't in actual use :-). On the v12 forced run attempt - same effect. Cleared all ollama images, rebuilt the local one from this PR, and still get:
|
@sempervictus Thanks for the output, seems like it would require some extra investigation 😅 |
I found that if I clone llama.cpp directly and checkout the commit in Makefile.sync and then build the RPC server, it doesn't seem to consistently work with this Ollama - but at least it does sometimes work 🎉 I get errors like this:
The target device has 24GBs of VRAM. It does seem to work when I use "qwen3:235b-a22b-q4_K_M". It's quite exciting when it does work 😄 |
@aquarat Do you experience the same issue as @sempervictus where it doesn't recognize the GPUs? |
@sempervictus I just realized that we might have made a small mistake with our env vars tests. Lets give it another try with this:
|
the GPUs get detected now using after adjusting the env vars under
When I specify the device as
|
@asaiacai Unfortunately, the RPC currently emulates a single GPU per instance, so multi-GPU support isn't available out of the box. You can run a separate RPC server for each GPU and connect them all, but it's not the most straightforward setup. Hopefully, we'll implement a better solution in the future. |
Since we're wrapping the RPC anyway, this is probably the layer at which that logic would make the most sense to implement. If doing just a discovery and init wrapper, probably something along the lines of enumerating GPUs and assigning a port incrementing from the base to each one then printing that for consumers to get... i'm sure we can automate that by explicit device definition at call time as well but figure it might be "slicker" if actually handled by the bin. |
@gkpln3 Using individual RPC servers and feeding those to the
distribution does occur but not between local and remote GPUs but CPU and RPC instead. Should we have each card in the SXM hosts on its own RPC server with a single front-end aggregating those? Is there a way to push more layers to the RPC targets?
Separately - what's the correct compose invocation to run these one-off RPC instances?
throws |
Seems the RPC services are somewhat unstable -
... local GPUs
no error logs in the RPC server output or |
Try enabling the debug logs. By setting |
Seems like docker-compose is passing everything as a single string. Try splitting it up like: command:
- rpc
- --device
- CUDA0
- --port
- 51020 |
This PR builds on top of the work done by @ecyht2 on #6729, following issue #4643.
It aims to add RPC support to Ollama based on llama.cpp RPC mechanism to allow distributed inference across multiple devices.
This PR has been tested and confirmed working on MacOS (fixing a race condition in distributed inference). best performance can be achieved by connecting the devices using Thunderbolt 4.
This PR also adds the
ollama rpc
command that allows running the RPC server on the other computer.