This project provides a user-friendly interface for running LLM models using llama-cpp-python, with API key authentication and usage tracking.
- Easy Model Setup: Run LLM models with different backends (CPU, CUDA, Metal, OpenBLAS)
- API Key Authentication: Secure your API with user-specific API keys
- Usage Tracking: Track API usage statistics per user
- System Monitoring: Monitor CPU, memory, and GPU usage
- User Management: Add and delete API keys through the UI
The system consists of the following components:
- Electron App: The main user interface for managing models, users, and viewing system information
- Setup Script: Installs llama-cpp-python with the appropriate backend and downloads models
- LLama-cpp-python Server: Runs the LLM model and provides the OpenAI-compatible API
- FastAPI Middleware: Handles API key authentication and usage tracking
-
When you click "Run Model" in the UI, the system:
- Runs the setup.sh script to install llama-cpp-python with the selected backend
- Downloads the model if it's a Hugging Face model ID
- Starts the llama-cpp-python server on port 8000
- Starts the FastAPI middleware on port 8080
-
The FastAPI middleware:
- Authenticates API requests using API keys
- Tracks usage statistics per user
- Forwards requests to the llama-cpp-python server
- Returns responses to the client
-
To use the API:
- Send requests to http://localhost:8080/v1/... (same endpoints as OpenAI API)
- Include your API key in the "api-key" header
The middleware provides the following endpoints:
- All OpenAI-compatible endpoints from llama-cpp-python (forwarded)
/admin/usage
: Get usage statistics for all users
The system tracks the following usage statistics per user:
- Total number of requests
- Total number of tokens used
- Last request timestamp
- Endpoint usage counts
- Python 3.8+
- Node.js 14+
- Electron
- For CUDA backend: NVIDIA GPU with CUDA toolkit
- For Metal backend: Apple Silicon or AMD GPU on macOS
MIT