8000 GitHub - kooshi/llama-swappo: llama-swap + a minimal ollama compatible api
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

kooshi/llama-swappo

 
 

Repository files navigation

llama-swappo

A fork of llama-swap with a minimally implemented ollama compatible api grafted onto it, so you can use it with clients that only support ollama.

This makes llama-swappo a drop in replacement for ollama, for enthusiests that want more control, with more compatability.

This commit automatically rebases onto the latest llama-swap nightly.

Features

  • ✅ Ollama API supported endpoints:
    • HEAD / - for health check
    • api/tags - to list models
    • api/show - for model details
    • api/ps - to show what's running
    • api/generate (untested, clients I've used so far seem to use the OpenAI compatible endpoints for actual generation and chat)
    • api/chat (untested)
    • api/embed
    • api/embeddings

How to install

git clone https://github.com/kooshi/llama-swappo
cd llama-swappo
go build

That will get you the executable. Place it wherever your currently installed program is, start it as usual, and it should just work.

Configuration

If you're using llama-server, it will try to parse your arguments for the additional metadata it needs like context length. Alternatively, you can define the values in your config, which will override the inferred values.

model1:
  cmd: path/to/cmd --arg1 one
  proxy: "http://localhost:8080"

  # these
  metadata:
    architecture: qwen3
    contextLength: 131072
    capabilities:
    - completion # for chat models
    - tools # for tool use (requires --jinja in llama-server, and you must compile with this PR included https://github.com/ggml-org/llama.cpp/pull/12379)
    - insert # for FITM coding, untested
    - vision # untested
    - embedding #untested
    family: qwen # probably not needed
    parameterSize: 32B # probably not needed
    quantizationLevel: 4Q_K_M # probably not needed

Support

This was a personal tweak so I could play with local models in Github Copilot without having to deal with ollama. I offered to merge this into the upstream repo, but the maintainer decided, and I agree, that this change overcomplicates the elegance of llama-swap, and would be too much of a burden to maintain forever. My interests have already swung back to some other projects, so I don't intend to support this seriously. I won't be providing docker images or anything else. I'll accept pull requests if you fix something though.

Original README follows


llama-swap header image GitHub Downloads (all assets, all releases) GitHub Actions Workflow Status GitHub Repo stars

llama-swap

llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.

Written in golang, it is very easy to install (single binary with no dependencies) and configure (single yaml file). To get started, download a pre-built binary or use the provided docker images.

Features:

  • ✅ Easy to deploy: single binary with no dependencies
  • ✅ Easy to config: single yaml file
  • ✅ On-demand model switching
  • ✅ OpenAI API supported endpoints:
    • v1/completions
    • v1/chat/completions
    • v1/embeddings
    • v1/rerank
    • v1/audio/speech (#36)
    • v1/audio/transcriptions (docs)
  • ✅ llama-swap custom API endpoints
    • /ui - web UI
    • /log - remote log monitoring
    • /upstream/:model_id - direct access to upstream HTTP server (demo)
    • /unload - manually unload running models (#58)
    • /running - list currently running models (#61)
  • ✅ Run multiple models at once with Groups (#107)
  • ✅ Automatic unloading of models after timeout by setting a ttl
  • ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
  • ✅ Docker and Podman support
  • ✅ Full control over server settings per model

How does llama-swap work?

When a request is made to an OpenAI compatible endpoint, lama-swap will extract the model value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to the correct one to serve the request.

In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the groups feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.

config.yaml

llama-swap is managed entirely through a yaml configuration file.

It can be very minimal to start:

models:
  "qwen2.5":
    cmd: |
      /path/to/llama-server
      -hf bartowski/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M
      --port ${PORT}

However, there are many more capabilities that llama-swap supports:

  • groups to run multiple models at once
  • ttl to automatically unload models
  • macros for reusable snippets
  • aliases to use familiar model names (e.g., "gpt-4o-mini")
  • env to pass custom environment variables to inference servers
  • cmdStop for to gracefully stop Docker/Podman containers
  • useModelName to override model names sent to upstream servers
  • healthCheckTimeout to control model startup wait times
  • ${PORT} automatic port variables for dynamic port assignment

See the configuration documentation in the wiki all options and examples.

Web UI

llama-swap ships with a web based interface to make it easier to monitor logs and check the status of models.

image

Docker Install (download images)

Docker is the quickest way to try out llama-swap:

# use CPU inference comes with the example config above
$ docker run -it --rm -p 9292:8080 ghcr.io/mostlygeek/llama-swap:cpu

# qwen2.5 0.5B
$ curl -s http://localhost:9292/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{"model":"qwen2.5","messages": [{"role": "user","content": "tell me a joke"}]}' | \
    jq -r '.choices[0].message.content'

# SmolLM2 135M
$ curl -s http://localhost:9292/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{"model":"smollm2","messages": [{"role": "user","content": "tell me a joke"}]}' | \
    jq -r '.choices[0].message.content'
Docker images are built nightly for cuda, intel, vulcan, etc ...

They include:

  • ghcr.io/mostlygeek/llama-swap:cpu
  • ghcr.io/mostlygeek/llama-swap:cuda
  • ghcr.io/mostlygeek/llama-swap:intel
  • ghcr.io/mostlygeek/llama-swap:vulkan
  • ROCm disabled until fixed in llama.cpp container

Specific versions are also available and are tagged with the llama-swap, architecture and llama.cpp versions. For example: ghcr.io/mostlygeek/llama-swap:v89-cuda-b4716

Beyond the demo you will likely want to run the containers with your downloaded models and custom configuration.

$ docker run -it --rm --runtime nvidia -p 9292:8080 \
  -v /path/to/models:/models \
  -v /path/to/custom/config.yaml:/app/config.yaml \
  ghcr.io/mostlygeek/llama-swap:cuda

Bare metal Install (download)

Pre-built binaries are available for Linux, Mac, Windows and FreeBSD. These are automatically published and are likely a few hours ahead of the docker releases. The baremetal install works with any OpenAI compatible server, not just llama-server.

  1. Download a release appropriate for your OS and architecture.
  2. Create a configuration file, see the configuration documentation.
  3. Run the binary with llama-swap --config path/to/config.yaml --listen localhost:8080. Available flags:
    • --config: Path to the configuration file (default: config.yaml).
    • --listen: Address and port to listen on (default: :8080).
    • --version: Show version information and exit.
    • --watch-config: Automatically reload the configuration file when it changes. This will wait for in-flight requests to complete then stop all running models (default: false).

Building from source

  1. Build requires golang and nodejs for the user interface.
  2. git clone git@github.com:mostlygeek/llama-swap.git
  3. make clean all
  4. Binaries will be in build/ subdirectory

Monitoring Logs

Open the http://<host>:<port>/ with your browser to get a web interface with streaming logs.

CLI access is also supported:

# sends up to the last 10KB of logs
curl http://host/logs'

# streams combined logs
curl -Ns 'http://host/logs/stream'

# just llama-swap's logs
curl -Ns 'http://host/logs/stream/proxy'

# just upstream's logs
curl -Ns 'http://host/logs/stream/upstream'

# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'

# skips history and just streams new log entries
curl -Ns 'http://host/logs/stream?no-history'

Do I need to use llama.cpp's server (llama-server)?

Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.

For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to SIGTERM signals to shutdown.

Star History

Star History Chart

About

llama-swap + a minimal ollama compatible api

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 83.2%
  • TypeScript 8.5%
  • Shell 4.4%
  • CSS 1.9%
  • Makefile 1.2%
  • JavaScript 0.3%
  • Other 0.5%
0