8000 Home · mostlygeek/llama-swap Wiki · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Benson Wong edited this page Jun 3, 2025 · 40 revisions

Warning

This is a work in progress.

About

These model guides are intended to help you get started quickly with llama-swap configuration snippets.

Model Guides

Company Model VRAM Requirement Server Notes Link
BGE reranker v2 m3 343MB llama.cpp v1/rerank API llama-server link
Google Gemma 3 27B 24GB to 27GB llama.cpp 100K context on single and dual 24GB GPUs link
Meta llama-3.3-70B 55GB llama.cpp 13 to 20 tok/sec with 2x3090 and P40 for speculative decoding link
Meta llama4-scout 68.62 GB llama.cpp Fully loading scout with 62K context onto 3x24GB GPUs link
Mistral Small 3.1 24GB llama.cpp text and vision support, 32K context link
Nomic-AI nomic-embed-text v1.5 280MB llama.cpp v1/embeddings with llama-server link
OpenAI whisper-large-v3-turbo 1.4GB whisper.cpp v1/audio/speech text to speech with whisper.cpp link
Qwen qwen3-30b-a3b 24 GB llama.cpp 113tok/s on a 3090 link
Qwen QwQ, Coder 32B 24 GB to 48GB llama.cpp Local copilot with Aider, QwQ and Qwen2.5 Coder 32B link

Use Case Guides

Clone this wiki locally
0