-
You must be signed in to change notification settings -
discover/gpu.go: Add Support for Distributed Inferencing #6729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello! Perhaps it would be more flexible to allow users to pass a list of RPC servers through an HTTP payload instead of using the Alternatively, we could provide flexibility by using the value from This approach would be useful, for example, when user have multiple RPC clusters (e.g., couple namespaces in Kubernetes) and need to switch between them depending on the use case. |
As far as I remember the rpc feature of llama.cpp works only in a static set. So you need to pass the rpc endpoints to control beforehand. That would work only with stateful sets on kubernetes as they have predictable names. Other way would be running a controller container that fetches all endpoints from the ollama service and creates the list for you. But be aware if you don´t use stateful sets the ip will change very likely on every restart of the pod. |
Maybe not related, but we have an example for llama.cpp distributed serving in 8000 kubernetes with LWS, take it away if you're interested. https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/llamacpp/README.md |
As explained by the comment by @hingstarne, llama.cpp only works with a static set of RPC servers. In theory this might work because the llama.cpp backend server only starts when a request is sent (as far as I know) but it is kept on for a certain amount of time. However, the EDIT: Better explanation. |
@ecyht2 hi!
You are correct. However, in ollama, the My proposal is to allow users to pass an
Kubernetes was just an example, this approach can be applied in various setups such as multiple VMs, an OpenStack, or a bunch of Raspberry Pi CM3 pre-configured using tools like Ansible and so on. Let me provide a small example to illustrate the concept. Suppose we have an ollama instance pre-configured to work with a limited set of RPC servers. Currently, we can make a request like this:
In this scenario, the server will communicate with the RPC backends defined in Now imagine if the client could pass an rpc attribute directly in the request:
With this approach, the user gains the flexibility to specify different RPC servers dynamically, without the need to restart the ollama instance. |
For |
There reason why I don't do your suggestion is because of required Lets say a request like this is sent: curl http://gpu02:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Why is the sky blue?",
"rpc": "backend01:50052,backend02:50052",
"keep_alive": 600,
"options": { "temperature": 0 }
}' Then another request is sent within the timeout period. curl http://gpu02:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Why is the sky blue?",
"rpc": "backend03:50052,backend04:50052",
"keep_alive": 600,
"options": { "temperature": 0 }
}' Because the model is already started the second requests would be using This is the main reason why I chose the option of setting the RPC servers when doing |
Okay, agree. |
ollama runs needsReload before each request. It includes a check for changes in parameters to the llama server, so if th 8000 e rpc backends change, it should cause a model reload. |
Ah OK, I will add it when I am free. |
Would this just be checking if options changes inside this method? Could you just grab |
Following up are we still planning on implementing this functionality? |
Yup, sorry, been a bit busy and unable to write code recently. |
Thanks for taking the time to post a PR. I noticed you've made some changes to server.cpp, so I wanted to let you know that we're about to merge another PR (#5034) to begin replacing that code with a new Go based equivalent. The goal of this is to add more unit testing and fix some long standing stability bugs while preserving vision model support. If you need help rebasing your PR don't hesitate to contact us by replying here. |
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>
Thanks for the notice. I am not sure what needs to be done. This PR requires the RPC backend from GGML. How should I go about adding it. From my understanding, something like this should be added go build -tags=rpc . |
@aquarat I was able to get your aquarat#3 branch built on Mac, Windows, and Linux, and was able to connect to the Mac and Linux llama.cpp RPC servers using ggml-org/llama.cpp@89b2b56 from Windows, but actually trying to run deepseek-r1:70b didn't seem to actually offload anything to the RPC servers. It seemed to prefer to swap from disk rather than make RPC calls. Also I'm not sure what happened, but your comment on this PR seems to have disappeared. |
Hey The first was related to the way llama.cpp was built. The other change related to the "gpu" package name changing to "discover" 05cd82e One of the things that occurred to me is that you need a separate RPC server executable that must be built from source and that it would be nice to incorporate it into ollama so it's one image/executable. And I deleted my comment here after I realised the merge conflict was more complex than I originally expected 😆 Have you tried any other models? It seems like Ollama is happy to spread some models across GPUs but not all? |
Hi @aquarat, I have a bunch of 16 GB M1 that I would like to stack to be able to run large models and then compare the speed with a Mac Studio with 192 GB of Ram. If you guide me, I can try and share results with the community. |
… support for runners
Sorry for the wait, it should work now. |
@dhiltgen @jmorganca This PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very exciting :)
I just figured I'd add some comments on language/docs to help improve the likelihood of this getting approved.
This doesn't build for me with docker (eg.
I tried several models.
|
This seems to fix the docker build:
I'm however not very familiar with Go, so perhaps this is not the appropriate way to fix it. (edited: it was originally |
I think putting in in the file is better. It might have worked when compiling on non Docker envioronments because the newer C++ compilers includes C++17 by default which is required for file system (see https://en.cppreference.com/w/cpp/filesystem) |
I've made some fixes and change to make this work on MacOS: https://github.com/gkpln3/ollama/tree/feat/rpc |
Hey, I’m not a maintainer of this project, but I’d say go for it @ PR. I’m keen to give your branch a go 😊 |
I can help testing. I have a bunch of mac mini m1 to try that. |
@igorschlum I'd love that, can you please test this PR? #10844 |
Closing in favor of #10844. |
This feature adds support for llama.cpp RPC. This allows for distributed inferencing on different devices.
This Pull Request aims to implement #4643.