Home

Warning

This is a work in progress.

About

These model guides are intended to help you get started quickly with llama-swap configuration snippets.

Company	Model	VRAM Requirement	Server	Notes	Link
BGE	reranker v2 m3	343MB	llama.cpp	`v1/rerank` API llama-server	link
Google	Gemma 3 27B	24GB to 27GB	llama.cpp	100K context on single and dual 24GB GPUs	link
Meta	llama-3.3-70B	55GB	llama.cpp	13 to 20 tok/sec with 2x3090 and P40 for speculative decoding	link
Meta	llama4-scout	68.62 GB	llama.cpp	Fully loading scout with 62K context onto 3x24GB GPUs	link
Mistral	Small 3.1	24GB	llama.cpp	text and vision support, 32K context	link
Nomic-AI	nomic-embed-text v1.5	280MB	llama.cpp	`v1/embeddings` with llama-server	link
OpenAI	whisper-large-v3-turbo	1.4GB	whisper.cpp	`v1/audio/speech` text to speech with whisper.cpp	link
Qwen	qwen3-30b-a3b	24 GB	llama.cpp	113tok/s on a 3090	link
Qwen	QwQ, Coder 32B	24 GB to 48GB	llama.cpp	Local copilot with Aider, QwQ and Qwen2.5 Coder 32B	link