discover/gpu.go: Add Support for Distributed Inferencing #6729

ecyht2 · 2024-09-10T14:24:43Z

This feature adds support for llama.cpp RPC. This allows for distributed inferencing on different devices.

This Pull Request aims to implement #4643.

EvilFreelancer · 2024-09-17T09:17:36Z

Hello! Perhaps it would be more flexible to allow users to pass a list of RPC servers through an HTTP payload instead of using the OLLAMA_RPC_SERVERS environment variable?

Alternatively, we could provide flexibility by using the value from OLLAMA_RPC_SERVERS if no list is provided by the user, and replacing it with the user-provided list when available. (and third case when OLLAMA_RPC_SERVERS is not set but user provided list of backends)

This approach would be useful, for example, when user have multiple RPC clusters (e.g., couple namespaces in Kubernetes) and need to switch between them depending on the use case.

hingstarne · 2024-09-18T08:58:56Z

Hello! Perhaps it would be more flexible to allow users to pass a list of RPC servers through an HTTP payload instead of using the OLLAMA_RPC_SERVERS environment variable?

Alternatively, we could provide flexibility by using the value from OLLAMA_RPC_SERVERS if no list is provided by the user, and replacing it with the user-provided list when available. (and third case when OLLAMA_RPC_SERVERS is not set but user provided list of backends)

This approach would be useful, for example, when user have multiple RPC clusters (e.g., couple namespaces in Kubernetes) and need to switch between them depending on the use case.

As far as I remember the rpc feature of llama.cpp works only in a static set. So you need to pass the rpc endpoints to control beforehand. That would work only with stateful sets on kubernetes as they have predictable names. Other way would be running a controller container that fetches all endpoints from the ollama service and creates the list for you. But be aware if you don´t use stateful sets the ip will change very likely on every restart of the pod.

kerthcet · 2024-09-18T09:29:42Z

Maybe not related, but we have an example for llama.cpp distributed serving in 8000 kubernetes with LWS, take it away if you're interested. https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/llamacpp/README.md

ecyht2 · 2024-09-18T13:18:38Z

Hello! Perhaps it would be more flexible to allow users to pass a list of RPC servers through an HTTP payload instead of using the OLLAMA_RPC_SERVERS environment variable?

Alternatively, we could provide flexibility by using the value from OLLAMA_RPC_SERVERS if no list is provided by the user, and replacing it with the user-provided list when available. (and third case when OLLAMA_RPC_SERVERS is not set but user provided list of backends)

This approach would be useful, for example, when user have multiple RPC clusters (e.g., couple namespaces in Kubernetes) and need to switch between them depending on the use case.

As explained by the comment by @hingstarne, llama.cpp only works with a static set of RPC servers. In theory this might work because the llama.cpp backend server only starts when a request is sent (as far as I know) but it is kept on for a certain amount of time. However, the OLLAMA_RPC_SERVERS option will not work for future requests if the current backend server is still up.

EDIT: Better explanation.

EvilFreelancer · 2024-09-18T14:18:44Z

@ecyht2 hi!

As far as I remember the rpc feature of llama.cpp works only in a static set. So you need to pass the rpc endpoints to control beforehand.

You are correct. However, in ollama, the llama-server is started dynamically with each user request for processing the required model (if not already started). ollama also has a keep_alive timeout for unloading inactive instances from memory. So practically, users already have control over llama-server instances.

My proposal is to allow users to pass an rpc option in the request payload (similar to model or keep_alive), enabling the ability to launch different instances of llama-server, each connected to separate sets of RPC backends.

That would work only with stateful sets on kubernetes as they have predictable names.

Kubernetes was just an example, this approach can be applied in various setups such as multiple VMs, an OpenStack, or a bunch of Raspberry Pi CM3 pre-configured using tools like Ansible and so on.

Let me provide a small example to illustrate the concept.

Suppose we have an ollama instance pre-configured to work with a limited set of RPC servers. Currently, we can make a request like this:

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

In this scenario, the server will communicate with the RPC backends defined in OLLAMA_RPC_SERVERS. However, if a client wants to query a different set of RPC servers, the current approach would require modifying the ollama settings, restarting the service, and then making the query again.

Now imagine if the client could pass an rpc attribute directly in the request:

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "rpc": "backend01:50052,backend01:50052",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

With this approach, the user gains the flexibility to specify different RPC servers dynamically, without the need to restart the ollama instance.

EvilFreelancer · 2024-09-18T14:33:09Z

As explained by the comment by @hingstarne, llama.cpp only works with a static set of RPC servers.

For llama.cpp, you're correct — it works with a static set of RPC servers.
However, with ollama, it's different. It can dynamically start multiple instances of the llama.cpp server as needed.

ecyht2 · 2024-09-19T01:55:19Z

dynamically with each user request for processing the required model (if not already started).

There reason why I don't do your suggestion is because of required model (if not already started) emphasis on (if not already started).

Lets say a request like this is sent:

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "rpc": "backend01:50052,backend02:50052",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

Then another request is sent within the timeout period.

curl http://gpu02:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "rpc": "backend03:50052,backend04:50052",
  "keep_alive": 600,
  "options": { "temperature": 0 }
}'

Because the model is already started the second requests would be using backend01:50052,backend02:50052 as the backend instead of expected backend03:50052,backend04:50052. This is due to the fact that llama-server is started with backend01:50052,backend02:50052 and the model parameters are already loaded. As far as I know, the model scheduler only checks for the model name to see if a model is loaded or not.

This is the main reason why I chose the option of setting the RPC servers when doing ollama serve. This from my understanding of what is going to happen. That said you can select a GPU from the request, but I believe it might cause the same unexpected behavior.

EvilFreelancer · 2024-09-19T12:25:03Z

Okay, agree.

rick-github · 2024-09-25T05:49:55Z

ollama runs needsReload before each request. It includes a check for changes in parameters to the llama server, so if th 8000 e rpc backends change, it should cause a model reload.

ecyht2 · 2024-09-25T08:00:57Z

ollama runs needsReload before each request. It includes a check for changes in parameters to the llama server, so if the rpc backends change, it should cause a model reload.

Ah OK, I will add it when I am free.

hidden1nin · 2024-09-26T15:05:31Z

Would this just be checking if options changes inside this method?
func (runner *runnerRef) needsReload(ctx context.Context, req *LlmRequest) bool {
Does !reflect.DeepEqual(optsExisting, optsNew) || // have the runner options changed? not cover this already?

Could you just grab func getAvailableServers() map[string]string { when type runnerRef struct { is made and store the result then compare it to a second call to type runnerRef struct { during needsReload?

hidden1nin · 2024-10-03T23:11:44Z

Following up are we still planning on implementing this functionality?

ecyht2 · 2024-10-04T10:35:39Z

Following up are we still planning on implementing this functionality?

Yup, sorry, been a bit busy and unable to write code recently.

dhiltgen · 2024-10-07T19:20:52Z

Thanks for taking the time to post a PR. I noticed you've made some changes to server.cpp, so I wanted to let you know that we're about to merge another PR (#5034) to begin replacing that code with a new Go based equivalent. The goal of this is to add more unit testing and fix some long standing stability bugs while preserving vision model support. If you need help rebasing your PR don't hesitate to contact us by replying here.

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

ecyht2 · 2024-10-13T02:37:11Z

Thanks for taking the time to post a PR. I noticed you've made some changes to server.cpp, so I wanted to let you know that we're about to merge another PR (#5034) to begin replacing that code with a new Go based equivalent. The goal of this is to add more unit testing and fix some long standing stability bugs while preserving vision model support. If you need help rebasing your PR don't hesitate to contact us by replying here.

Thanks for the notice. I am not sure what needs to be done. This PR requires the RPC backend from GGML. How should I go about adding it.

From my understanding, something like this should be added

go build -tags=rpc .

dsluo · 2025-03-10T15:37:18Z

@aquarat I was able to get your aquarat#3 branch built on Mac, Windows, and Linux, and was able to connect to the Mac and Linux llama.cpp RPC servers using ggml-org/llama.cpp@89b2b56 from Windows, but actually trying to run deepseek-r1:70b didn't seem to actually offload anything to the RPC servers. It seemed to prefer to swap from disk rather than make RPC calls.

Also I'm not sure what happened, but your comment on this PR seems to have disappeared.

aquarat · 2025-03-10T15:51:02Z

@aquarat I was able to get your aquarat#3 branch built on Mac, Windows, and Linux, and was able to connect to the Mac and Linux llama.cpp RPC servers using ggml-org/llama.cpp@89b2b56 from Windows, but actually trying to run deepseek-r1:70b didn't seem to actually offload anything to the RPC servers. It seemed to prefer to swap from disk rather than make RPC calls.

Also I'm not sure what happened, but your comment on this PR seems to have disappeared.

Hey
Wow, cool! I must admit I haven't had a chance to actually test anything yet, job and all. The merge conflict was caused by two large changes in this repo since October 2024.

The first was related to the way llama.cpp was built.

The other change related to the "gpu" package name changing to "discover" 05cd82e

One of the things that occurred to me is that you need a separate RPC server executable that must be built from source and that it would be nice to incorporate it into ollama so it's one image/executable.

And I deleted my comment here after I realised the merge conflict was more complex than I originally expected 😆

Have you tried any other models? It seems like Ollama is happy to spread some models across GPUs but not all?

igorschlum · 2025-03-10T18:56:53Z

Hi @aquarat, I have a bunch of 16 GB M1 that I would like to stack to be able to run large models and then compare the speed with a Mac Studio with 192 GB of Ram. If you guide me, I can try and share results with the community.

… support for runners

ecyht2 · 2025-05-04T07:42:57Z

Sorry for the wait, it should work now.

ecyht2 · 2025-05-06T00:39:20Z

@dhiltgen @jmorganca This PR is ready for review.

aquarat

Very exciting :)
I just figured I'd add some comments on language/docs to help improve the likelihood of this getting approved.

discover/gpu.go

docs/distributed_inferencing.md

envconfig/config.go

ml/backend.go

ml/backend/ggml/ggml.go

aquarat · 2025-05-09T15:52:15Z

This doesn't build for me with docker (eg. docker build .). It does build directly though. If I run the directly build executable, regardless of using RPC servers or not, I get:

time=2025-05-09T15:49:08.418Z level=ERROR source=server.go:466 msg="llama runner terminated" error="exit status 2"
time=2025-05-09T15:49:08.460Z level=ERROR source=sched.go:476 msg="error loading llama server" error="llama runner process has terminated: error loading model: failed to find a compatible buffer type for tensor blk.0.attn_norm.weight\nllama_model_load_from_file_impl: failed to load model"

I tried several models.

 > [build 6/6] RUN --mount=type=cache,target=/root/.cache/go-build     go build -trimpath -buildmode=pie -o /bin/ollama .:                                                                                            
2.264 # github.com/ollama/ollama/ml/backend/ggml/ggml/src/ggml-rpc                                                                                                                                                    
2.264 ggml-rpc.cpp:33:21: error: 'filesystem' is not a namespace-name                                                                                                                                                 
2.264    33 | namespace fs = std::filesystem;                                                                                                                                                                         
2.264       |                     ^~~~~~~~~~                     
(...)
Dockerfile:97
--------------------
  96 |     ENV CGO_ENABLED=1
  97 | >>> RUN --mount=type=cache,target=/root/.cache/go-build \
  98 | >>>     go build -trimpath -buildmode=pie -o /bin/ollama .
  99 |

eras · 2025-05-10T15:48:21Z

This doesn't build for me with docker (eg. docker build .).

This seems to fix the docker build:

diff --git a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
index 8a61cb93..1c4d711f 100644
--- a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
+++ b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
@@ -2,4 +2,5 @@ package rpc
 
 // #cgo CPPFLAGS: -I${SRCDIR}/../../include
 // #cgo CPPFLAGS: -I${SRCDIR}/../
+// #cgo CXXFLAGS: -std=c++17
 import "C"

I'm however not very familiar with Go, so perhaps this is not the appropriate way to fix it. env CXXFLAGS=-std=c++17 in the Dockerfile for the compilation step didn't fix it, though.

(edited: it was originally CPPFLAGS, of course CXXFLAGS is more appropriate, though both work)

ecyht2 · 2025-05-11T06:36:04Z

This doesn't build for me with docker (eg. docker build .).

This seems to fix the docker build:

diff --git a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
index 8a61cb93..1c4d711f 100644
--- a/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
+++ b/ml/backend/ggml/ggml/src/ggml-rpc/rpc.go
@@ -2,4 +2,5 @@ package rpc
 
 // #cgo CPPFLAGS: -I${SRCDIR}/../../include
 // #cgo CPPFLAGS: -I${SRCDIR}/../
+// #cgo CXXFLAGS: -std=c++17
 import "C"

I think putting in in the file is better. It might have worked when compiling on non Docker envioronments because the newer C++ compilers includes C++17 by default which is required for file system (see https://en.cppreference.com/w/cpp/filesystem)

gkpln3 · 2025-05-17T17:15:10Z

I've made some fixes and change to make this work on MacOS: https://github.com/gkpln3/ollama/tree/feat/rpc

gkpln3 · 2025-05-24T15:14:54Z

This version does not currently work (at least not on my Macbook due to race a condition). I've continued the work on this feature on my fork of Ollama and it is working well.
Should I open a new pull request? @aquarat
@ecyht2 Would you like to review these changes?

aquarat · 2025-05-24T15:33:21Z

This version does not currently work (at least not on my Macbook due to race a condition). I've continued the work on this feature on my fork of Ollama and it is working well. Should I open a new pull request? @aquarat

Hey, I’m not a maintainer of this project, but I’d say go for it @ PR. I’m keen to give your branch a go 😊

igorschlum · 2025-05-25T10:39:41Z

I can help testing. I have a bunch of mac mini m1 to try that.

gkpln3 · 2025-05-25T12:38:31Z

@igorschlum I'd love that, can you please test this PR? #10844

ecyht2 · 2025-05-30T02:28:35Z

Closing in favor of #10844.

ecyht2 marked this pull request as draft September 13, 2024 01:06

ecyht2 force-pushed the feat/rpc branch from 15b613c to c3bae74 Compare September 17, 2024 06:23

ecyht2 marked this pull request as ready for review September 17, 2024 06:33

ecyht2 force-pushed the feat/rpc branch from c3bae74 to 1e105d6 Compare September 17, 2024 06:39

ecyht2 marked this pull request as draft September 25, 2024 08:01

ecyht2 mentioned this pull request Sep 25, 2024

Llama.cpp now supports distributed inference across multiple machines. #4643

Open

ecyht2 marked this pull request as ready for review October 13, 2024 00:36

ecyht2 force-pushed the feat/rpc branch from e6c86e5 to abc8e70 Compare October 13, 2024 00:47

ecyht2 added 5 commits October 13, 2024 09:46

feat: Added support for llama.cpp RPC

7015fce

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

doc: Added documentation for distributed inferencing

c234eea

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

feat: Added Memory Check for RPC Servers

bf80325

Signed-off-by: ecyht2 <ecyht2@nottingham.edu.my>

feat: Added option to change RPC servers in HTTP options

ed78baa

doc: Added docs for new API options

4f76c4b

ecyht2 marked this pull request as draft October 13, 2024 02:37

ecyht2 force-pushed the feat/rpc branch from abc8e70 to 4f76c4b Compare October 13, 2024 04:11

rick-github mentioned this pull request Apr 9, 2025

双服务器显卡共用 #10189

Closed

ecyht2 added 6 commits May 3, 2025 10:47

Merge remote-tracking branch 'upstream/main' into feat/rpc

b177dcf

Merge remote-tracking branch 'upstream/main' into feat/rpc

8068cd1

server/sched.go: Fixed missing legacy gpu module

ca5c567

dicover/gpu.go: Updated RPC communication to support new protocol

f645eec

CMakeLists.txt: Added RPC support for ggml

8e0a4e9

runner/llamarunner/runner.go,runner/ollamarunner/runner.go: Added rpc…

30a612d

… support for runners

ecyht2 marked this pull request as ready for review May 4, 2025 07:42

ecyht2 changed the title ~~Feature: Add Support for Distributed Inferencing~~ discover/gpu.go: Add Support for Distributed Inferencing May 4, 2025

Merged with main

6828e89

aquarat reviewed May 6, 2025

View reviewed changes

docs/distributed_inferencing.md: Fix spelling mistakes and grammar

a0c0891

ml/backend/ggml/ggml/src/ggml-rpc/rpc.go: Fix missing std::filesystem

03d6545

llm/server.go: Fixed merge conflicts with main

ac1370d

gkpln3 mentioned this pull request May 24, 2025

discover/gpu.go: Add Support for Distributed Inferencing (continued) #10844

Open

ecyht2 closed this May 30, 2025

discover/gpu.go: Add Support for Distributed Inferencing #6729

discover/gpu.go: Add Support for Distributed Inferencing #6729

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!