Paddler

Paddler is an open-source load balancer and reverse proxy designed to optimize servers running llama.cpp.

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests.

Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Additionally, Paddler uses agents to monitor the health of individual llama.cpp instances, providing feedback to the load balancer for optimal performance. Paddler also supports the dynamic addition or removal of llama.cpp servers, enabling integration with autoscaling tools.

Note

In simple terms, the slots in llama.cpp refer to predefined memory slices within the server that handle individual requests. When a request comes in, it is assigned to an available slot for processing. They are predictable and highly configurable.

You can learn more about them in llama.cpp server documentation.

How it Works

Registering llama.cpp Instances

The sequence repeats for each agent. Agents should be installed alongside llama.cpp instance to report their health status to the load balancer.

sequenceDiagram
    participant loadbalancer as Paddler Load Balancer
    participant agent as Paddler Agent
    participant llamacpp as llama.cpp

    agent->>llamacpp: Hey, are you alive?
    llamacpp-->>agent: Yes, this is my health status
    agent-->>loadbalancer: llama.cpp is still working
    loadbalancer->>llamacpp: I have a request for you to handle

Tutorials

Usage

Installation

You can download the latest release from the releases page.

Alternatively, you can build the project yourself. You need go>=1.21 and nodejs (for the dashboard's front-end code) to build the project.

$ git clone git@github.com:distantmagic/paddler.git
$ cd paddler
$ pushd ./management
$ make esbuild # dashboard front-end
$ popd
$ go build -o paddler

Running Agents

The agent should be installed in the same host as llama.cpp.

It needs a few pieces of information:

external-* tells how the load balancer can connect to the llama.cpp instance
local-* tells how the agent can connect to the llama.cpp instance
management-* tell where the agent should report the health status

./paddler agent \
    --external-llamacpp-host 127.0.0.1 \
    --external-llamacpp-port 8088 \
    --local-llamacpp-host 127.0.0.1 \
    --local-llamacpp-port 8088 \
    --management-host 127.0.0.1 \
    --management-port 8085

Replace hosts and ports with your own server addresses when deploying.

Running Load Balancer

Load balancer collects data from agents and exposes reverse proxy to the outside world.

It requires two sets of flags:

management-* tells where the load balancer should listen for updates from agents
reverseproxy-* tells how load balancer can be reached from the outside hosts

./paddler balancer \
    --management-host 127.0.0.1 \
    --management-port 8085 \
    --reverseproxy-host 196.168.2.10 \
    --reverseproxy-port 8080

management-host and management-port in agents should be the same as in the load balancer.

You can enable dashboard to see the status of the agents with --management-dashboard-enable=true flag. If enabled it is available at the management server address under /dashboard path.

Feature Highlight

Aggregated Health Status

Paddler overrides /health endpoint of llama.cpp and reports the total number of available and processing slots.

AWS Integration

Note

Available since v0.3.0

When running on AWS EC2, you can replace --local-llamacpp-host with aws:metadata:local-ipv4. In that case, Paddler will use EC2 instance metadata to fetch the local IP address (from the local network):

If you want to keep the balancer management address predictable, I recommend using Route 53 to create a record that always points to your load balancer (for example paddler_balancer.example.com), which makes it something like that in the end:

./paddler agent \
    --external-llamacpp-host aws:metadata:local-ipv4 \
    --external-llamacpp-port 8088 \
    --local-llamacpp-host 127.0.0.1 \
    --local-llamacpp-port 8088 \
    --management-host paddler_balancer.example.com \
    --management-port 8085

Buffered Requests (Scaling from Zero Hosts)

Note

Available since v0.3.0

Load balancer's buffered requests allow your infrastructure to scale from zero hosts by providing an additional metric (requests waiting to be handled).

It also gives your infrastructure some additional time to add additional hosts. For example, if your autoscaler is setting up an additional server, putting an incoming request on hold for 60 seconds might give it a chance to be handled even though there might be no available llama.cpp instances at the moment of issuing it.

Scaling from zero hosts is especially suitable for low-traffic projects because it allows you to cut costs on your infrastructure—you won't be paying your cloud provider anything if you are not using your service at the moment.

paddler_buffer.mp4

State Dashboard

Although Paddler integrates with the StatsD protocol, you can preview the cluster's state using a built-in dashboard.

StatsD Metrics

Note

Available since v0.3.0

Tip

If you keep your stack self-hosted you can use Prometheus with StatsD exporter to handle the incoming metrics.

Tip

This feature works with AWS CloudWatch Agent as well.

Paddler supports the following StatsD metrics:

paddler.requests_buffered number of buffered requests since the last report (resets after each report)
paddler.slots_idle total idle slots
paddler.slots_processing total slots processing requests

All of them use gauge internally.

StatsD metrics need to be enabled with the following flags:

./paddler balancer \
    # .. put all the other flags here ...
    --statsd-enable=true \
    --statsd-host=127.0.0.1 \
    --statsd-port=8125 \
    --statsd-scheme=http

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github		.github
agent		agent
cmd		cmd
goroutine		goroutine
infra		infra
llamacpp		llamacpp
loadbalancer		loadbalancer
management		management
netcfg		netcfg
reverseproxy		reverseproxy
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paddler

How it Works

Registering llama.cpp Instances

Tutorials

Usage

Installation

Running Agents

Running Load Balancer

Feature Highlight

Aggregated Health Status

AWS Integration

Buffered Requests (Scaling from Zero Hosts)

State Dashboard

StatsD Metrics

Changelog

v0.3.0

Features

v0.1.0

Features

Community

About

Uh oh!

Releases

Packages

Languages

License

dagelf/paddler

Folders and files

Latest commit

History

Repository files navigation

Paddler

How it Works

Registering llama.cpp Instances

Tutorials

Usage

Installation

Running Agents

Running Load Balancer

Feature Highlight

Aggregated Health Status

AWS Integration

Buffered Requests (Scaling from Zero Hosts)

State Dashboard

StatsD Metrics

Changelog

v0.3.0

Features

v0.1.0

Features

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages