8000 discover/gpu.go: Add Support for Distributed Inferencing by ecyht2 · Pull Request #6729 · ollama/ollama · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

discover/gpu.go: Add Support for Distributed Inferencing #6729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

ecyht2
Copy link
@ecyht2 ecyht2 commented Sep 10, 2024

This feature adds support for llama.cpp RPC. This allows for distributed inferencing on different devices.

This Pull Request aims to implement #4643.

@ecyht2 ecyht2 marked this pull request as draft September 13, 2024 01:06
@ecyht2 ecyht2 marked this pull request as ready for review September 17, 2024 06:33
@EvilFreelancer
Copy link
EvilFreelancer commented Sep 17, 2024

Hello! Perhaps it would be more flexible to allow users to pass a list of RPC servers through an HTTP payload instead of using the OLLAMA_RPC_SERVERS environment variable?

Alternatively, we could provide flexibility by using the value from OLLAMA_RPC_SERVERS if no list is provided by the user, and replacing it with the user-provided list when available. (and third case when OLLAMA_RPC_SERVERS is not set but user provided list of backends)

This approach would be useful, for example, when user have multiple RPC clusters (e.g., couple namespaces in Kubernetes) and need to switch between them depending on the use case.

@hingstarne
Copy link

Hello! Perhaps it would be more flexible to allow users to pass a list of RPC servers through an HTTP payload instead of using the OLLAMA_RPC_SERVERS environment variable?

Alternatively, we could provide flexibility by using the value from OLLAMA_RPC_SERVERS if no list is provided by the user, and replacing it with the user-provided list when available. (and third case when OLLAMA_RPC_SERVERS is not set but user provided list of backends)

This approach would be useful, for example, when user have multiple RPC clusters (e.g., couple namespaces in Kubernetes) and need to switch between them depending on the use case.

As far as I remember the rpc feature of llama.cpp works only in a static set. So you need to pass the rpc endpoints to control beforehand. That would work only with stateful sets on kubernetes as they have predictable names. Other way would be running a controller container that fetches all endpoints from the ollama service and creates the list for you. But be aware if you don´t use stateful sets the ip will change very likely on every restart of the pod.

@kerthcet
Copy link
kerthcet commented Sep 18, 2024

Maybe not related, but we have an example for llama.cpp distributed serving in 8000 kubernetes with LWS, take it away if you're interested. https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/llamacpp/README.md

@ecyht2
Copy link
Author
ecyht2 commented Sep 18, 2024

Hello! Perhaps it would be more flexible to allow users to pass a list of RPC servers through an HTTP payload instead of using the OLLAMA_RPC_SERVERS environment variable?

Alternatively, we could provide flexibility by using the value from OLLAMA_RPC_SERVERS if no list is provided by the user, and replacing it with the user-provided list when available. (and third case when OLLAMA_RPC_SERVERS is not set but user provided list of backends)

This approach would be useful, for example, when user have multiple RPC clusters (e.g., couple namespaces in Kubernetes) and need to switch between them depending on the use case.

As explained by the comment by @hingstarne, llama.cpp only works with a static set of RPC servers. In theory this might work because the llama.cpp backend server only starts when a request is sent (as far as I know) but it is kept on for a certain amount of time. However, the OLLAMA_RPC_SERVERS option will not work for future requests if the current backend server is still up.

EDIT: Better explanation.

@EvilFreelancer
Copy link
EvilFreelancer commented Sep 18, 2024

@ecyht2 hi!

As far as I remember the rpc feature of llama.cpp works only in a static set. So you need to pass the rpc endpoints to control beforehand.

You are correct. However, in ollama, the llama-server is started dynamically with each user request for processing the required model (if not already started). ollama also has a keep_alive timeout for unloading inactive instances from memory. So practically, users already have control over llama-server instances.

My proposal is to allow users to pass an rpc option in the request payload (similar to model or keep_alive), enabling the ability to launch different instances of llama-server, each connected to separate sets of RPC backends.

That would work only with stateful sets on kubernetes as they have predictable names.

Kubernetes was just an example, this approach can be applied in various setups such as multiple VMs, an OpenStack, or a bunch of Raspberry Pi CM3 pre-configured using tools like Ansible and so on.


Let me provide a small example to illustrate the concept.

Suppose we have an ollama instance pre-configured to work with a limited set of RPC servers. Currently, we can make a request like this:

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

In this scenario, the server will communicate with the RPC backends defined in OLLAMA_RPC_SERVERS. However, if a client wants to query a different set of RPC servers, the current approach would require modifying the ollama settings, restarting the service, and then making the query again.

Now imagine if the client could pass an rpc attribute directly in the request:

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "rpc": "backend01:50052,backend01:50052",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

With this approach, the user gains the flexibility to specify different RPC servers dynamically, without the need to restart the ollama instance.

@EvilFreelancer
Copy link

As explained by the comment by @hingstarne, llama.cpp only works with a static set of RPC servers.

For llama.cpp, you're correct — it works with a static set of RPC servers.
However, with ollama, it's different. It can dynamically start multiple instances of the llama.cpp server as needed.

@ecyht2
Copy link
Author
ecyht2 commented Sep 19, 2024

dynamically with each user request for processing the required model (if not already started).

There reason why I don't do your suggestion is because of required model (if not already started) emphasis on (if not already started).

Lets say a request like this is sent:

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "rpc": "backend01:50052,backend02:50052",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

Then another request is sent within the timeout period.

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "rpc": "backend03:50052,backend04:50052",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

Because the model is already started the second requests would be using backend01:50052,backend02:50052 as the backend instead of expected backend03:50052,backend04:50052. This is due to the fact that llama-server is started with backend01:50052,backend02:50052 and the model parameters are already loaded. As far as I know, the model scheduler only checks for the model name to see if a model is loaded or not.

This is the main reason why I chose the option of setting the RPC servers when doing ollama serve. This from my understanding of what is going to happen. That said you can select a GPU from the request, but I believe it might cause the same unexpected behavior.

@EvilFreelancer
Copy link

Okay, agree.

@rick-github
Copy link
Collaborator

ollama runs needsReload before each request. It includes a check for changes in parameters to the llama server, so if th 8000 e rpc backends change, it should cause a model reload.

@ecyht2
Copy link
Author
ecyht2 commented Sep 25, 2024

ollama runs needsReload before each request. It includes a check for changes in parameters to the llama server, so if the rpc backends change, it should cause a model reload.

Ah OK, I will add it when I am free.

@hidden1nin
Copy link
Contributor
hidden1nin commented Sep 26, 2024

Would this just be checking if options changes inside this method?
func (runner *runnerRef) needsReload(ctx context.Context, req *LlmRequest) bool {
Does !reflect.DeepEqual(optsExisting, optsNew) || // have the runner options changed? not cover this already?

Could you just grab func getAvailableServers() map[string]string { when type runnerRef struct { is made and store the result then compare it to a second call to type runnerRef struct { during needsReload?

@hidden1nin
Copy link
Contributor

Following up are we still planning on implementing this functionality?

@ecyht2
Copy link
Author
ecyht2 commented Oct 4, 2024

Following up are we still planning on implementing this functionality?

Yup, sorry, been a bit busy and unable to write code recently.

@dhiltgen
Copy link
Collaborator
dhiltgen commented Oct 7, 2024

Thanks for taking the time to post a PR. I noticed you've made some changes to server.cpp, so I wanted to let you know that we're about to merge another PR (#5034) to begin replacing that code with a new Go based equivalent. The goal of this is to add more unit testing and fix some long standing stability bugs while preserving vision model support. If you need help rebasing your PR don't hesitate to contact us by replying here.

@ecyht2 ecyht2 marked this pull request as ready for review October 13, 2024 00:36
@ecyht2
Copy link
Author
ecyht2 commented Oct 13, 2024

Thanks for taking the time to post a PR. I noticed you've made some changes to server.cpp, so I wanted to let you know that we're about to merge another PR (#5034) to begin replacing that code with a new Go based equivalent. The goal of this is to add more unit testing and fix some long standing stability bugs while preserving vision model support. If you need help rebasing your PR don't hesitate to contact us by replying here.

Thanks for the notice. I am not sure what needs to be done. This PR requires the RPC backend from GGML. How should I go about adding it.

From my understanding, something like this should be added

go build -tags=rpc .

@ecyht2 ecyht2 marked this pull request as draft October 13, 2024 02:37
@dsluo
Copy link
dsluo commented Mar 10, 2025

@aquarat I was able to get your aquarat#3 branch built on Mac, Windows, and Linux, and was able to connect to the Mac and Linux llama.cpp RPC servers using ggml-org/llama.cpp@89b2b56 from Windows, but actually trying to run deepseek-r1:70b didn't seem to actually offload anything to the RPC servers. It seemed to prefer to swap from disk rather than make RPC calls.

Also I'm not sure what happened, but your comment on this PR seems to have disappeared.

@aquarat
Copy link
aquarat commented Mar 10, 2025

@aquarat I was able to get your aquarat#3 branch built on Mac, Windows, and Linux, and was able to connect to the Mac and Linux llama.cpp RPC servers using ggml-org/llama.cpp@89b2b56 from Windows, but actually trying to run deepseek-r1:70b didn't seem to actually offload anything to the RPC servers. It seemed to prefer to swap from disk rather than make RPC calls.

Also I'm not sure what happened, but your comment on this PR seems to have disappeared.

Hey
Wow, cool! I must admit I haven't had a chance to actually test anything yet, job and all. The merge conflict was caused by two large changes in this repo since October 2024.

The first was related to the way llama.cpp was built.

The other change related to the "gpu" package name changing to "discover" 05cd82e

One of the things that occurred to me is that you need a separate RPC server executable that must be built from source and that it would be nice to incorporate it into ollama so it's one image/executable.

And I deleted my comment here after I realised the merge conflict was more complex than I originally expected 😆

Have you tried any other models? It seems like Ollama is happy to spread some models across GPUs but not all?

@igorschlum
Copy link

Hi @aquarat, I have a bunch of 16 GB M1 that I would like to stack to be able to run large models and then compare the speed with a Mac Studio with 192 GB of Ram. If you guide me, I can try and share results with the community.

@ecyht2 ecyht2 marked this pull request as ready for review May 4, 2025 07:42
@ecyht2
Copy link
Author
ecyht2 commented May 4, 2025

Sorry for the wait, it should work now.

@ecyht2 ecyht2 changed the title Feature: Add Support for Distributed Inferencing discover/gpu.go: Add Support for Distributed Inferencing May 4, 2025
@ecyht2
Copy link
Author
ecyht2 commented May 6, 2025

@dhiltgen @jmorganca This PR is ready for review.

Copy link
@aquarat aquarat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting :)
I just figured I'd add some comments on language/docs to help improve the likelihood of this getting approved.

@aquarat
Copy link
aquarat commented May 9, 2025

This doesn't build for me with docker (eg. docker build .). It does build directly though. If I run the directly build executable, regardless of using RPC servers or not, I get:

time=2025-05-09T15:49:08.418Z level=ERROR source=server.go:466 msg="llama runner terminated" error="exit status 2"
time=2025-05-09T15:49:08.460Z level=ERROR source=sched.go:476 msg="error loading llama server" error="llama runner process has terminated: error loading model: failed to find a compatible buffer type for tensor blk.0.attn_norm.weight\nllama_model_load_from_file_impl: failed to load model"

I tried several models.

 > [build 6/6] RUN --mount=type=cache,target=/root/.cache/go-build     go build -trimpath -buildmode=pie -o /bin/ollama .:                                                                                            
2.264 # github.com/ollama/ollama/ml/backend/ggml/ggml/src/ggml-rpc                                                                                                                                                    
2.264 ggml-rpc.cpp:33:21: error: 'filesystem' is not a namespace-name                                                                                                                                                 
2.264    33 | namespace fs = std::filesystem;                                                                                                                                                                         
2.264       |                     ^~~~~~~~~~                     
(...)
Dockerfile:97
--------------------
  96 |     ENV CGO_ENABLED=1
  97 | >>> RUN --mount=type=cache,target=/root/.cache/go-build \
  98 | >>>     go build -trimpath -buildmode=pie -o /bin/ollama .
  99 |  

@eras
Copy link
eras commented May 10, 2025

This doesn't build for me with docker (eg. docker build .).

This seems to fix the docker build:

diff --git a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
index 8a61cb93..1c4d711f 100644
--- a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
+++ b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
@@ -2,4 +2,5 @@ package rpc
 
 // #cgo CPPFLAGS: -I${SRCDIR}/../../include
 // #cgo CPPFLAGS: -I${SRCDIR}/../
+// #cgo CXXFLAGS: -std=c++17
 import "C"

I'm however not very familiar with Go, so perhaps this is not the appropriate way to fix it. env CXXFLAGS=-std=c++17 in the Dockerfile for the compilation step didn't fix it, though.

(edited: it was originally CPPFLAGS, of course CXXFLAGS is more appropriate, though both work)

@ecyht2
Copy link
Author
ecyht2 commented May 11, 2025

This doesn't build for me with docker (eg. docker build .).

This seems to fix the docker build:

diff --git a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
index 8a61cb93..1c4d711f 100644
--- a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
+++ b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
@@ -2,4 +2,5 @@ package rpc
 
 // #cgo CPPFLAGS: -I${SRCDIR}/../../include
 // #cgo CPPFLAGS: -I${SRCDIR}/../
+// #cgo CXXFLAGS: -std=c++17
 import "C"

I think putting in in the file is better. It might have worked when compiling on non Docker envioronments because the newer C++ compilers includes C++17 by default which is required for file system (see https://en.cppreference.com/w/cpp/filesystem)

@gkpln3
Copy link
gkpln3 commented May 17, 2025

I've made some fixes and change to make this work on MacOS: https://github.com/gkpln3/ollama/tree/feat/rpc

@gkpln3
Copy link
gkpln3 commented May 24, 2025

This version does not currently work (at least not on my Macbook due to race a condition). I've continued the work on this feature on my fork of Ollama and it is working well.
Should I open a new pull request? @aquarat
@ecyht2 Would you like to review these changes?

@aquarat
Copy link
aquarat commented May 24, 2025

This version does not currently work (at least not on my Macbook due to race a condition). I've continued the work on this feature on my fork of Ollama and it is working well. Should I open a new pull request? @aquarat

Hey, I’m not a maintainer of this project, but I’d say go for it @ PR. I’m keen to give your branch a go 😊

@igorschlum
Copy link

I can help testing. I have a bunch of mac mini m1 to try that.

@gkpln3
Copy link
gkpln3 commented May 25, 2025

@igorschlum I'd love that, can you please test this PR? #10844

@ecyht2
Copy link
Author
ecyht2 commented May 30, 2025

Closing in favor of #10844.

@ecyht2 ecyht2 closed this May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0