A modern, full-stack chat application demonstrating how to integrate React frontend with a Go backend and run local Large Language Models (LLMs) using Docker's Model Runner.
This project showcases a complete Generative AI interface that includes:
- React/TypeScript frontend with a responsive chat UI
- Go backend server for API handling
- Integration with Docker's Model Runner to run Llama 3.2 locally
- Comprehensive observability with metrics, logging, and tracing
- NEW: llama.cpp metrics integration directly in the UI
- π¬ Interactive chat interface with message history
- π Real-time streaming responses (tokens appear as they're generated)
- π Light/dark mode support based on user preference
- π³ Dockerized deployment for easy setup and portability
- π Run AI models locally without cloud API dependencies
- π Cross-origin resource sharing (CORS) enabled
- π§ͺ Integration testing using Testcontainers
- π Metrics and performance monitoring
- π Structured logging with zerolog
- π Distributed tracing with OpenTelemetry
- π Grafana dashboards for visualization
- π Advanced llama.cpp performance metrics
The application consists of these main components:
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Frontend β >>> β Backend β >>> β Model Runnerβ
β (React/TS) β β (Go) β β (Llama 3.2) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
:3000 :8080 :12434
β β
βββββββββββββββ βββββββ βββββββ βββββββββββββββ
β Grafana β <<< β Prometheus β β Jaeger β
β Dashboards β β Metrics β β Tracing β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
:3001 :9091 :16686
There are two ways to connect to Model Runner:
This method uses Docker's internal DNS resolution to connect to the Model Runner:
- Connection URL:
http://model-runner.docker.internal/engines/llama.cpp/v1/
- Configuration is set in
backend.env
This method uses host-side TCP support:
- Connection URL:
host.docker.internal:12434
- Requires updates to the environment configuration
- Docker and Docker Compose
- Git
- Go 1.19 or higher (for local development)
- Node.js and npm (for frontend development)
Before starting, pull the required model:
docker model pull ai/llama3.2:1B-Q8_0
-
Clone this repository:
git clone https://github.com/ajeetraina/genai-app-demo.git cd genai-app-demo
-
Start the application using Docker Compose:
docker compose up -d --build
-
Access the frontend at http://localhost:3000
-
Access observability dashboards:
- Grafana: http://localhost:3001 (admin/admin)
Ensure that you provide http://prometheus:9090
instead of localhost:9090
to see the metrics on the Grafana dashboard.
- Jaeger UI: http://localhost:16686
- Prometheus: http://localhost:9091
The frontend is built with React, TypeScript, and Vite:
cd frontend
npm install
npm run dev
This will start the development server at http://localhost:3000.
The Go backend can be run directly:
go mod download
go run main.go
Make sure to set the required environment variables from backend.env
:
BASE_URL
: URL for the model runnerMODEL
: Model identifier to useAPI_KEY
: API key for authentication (defaults to "ollama")LOG_LEVEL
: Logging level (debug, info, warn, error)LOG_PRETTY
: Whether to output pretty-printed logsTRACING_ENABLED
: Enable OpenTelemetry tracingOTLP_ENDPOINT
: OpenTelemetry collector endpoint
- The frontend sends chat messages to the backend API
- The backend formats the messages and sends them to the Model Runner
- The LLM processes the input and generates a response
- The backend streams the tokens back to the frontend as they're generated
- The frontend displays the incoming tokens in real-time
- Observability components collect metrics, logs, and traces throughout the process
βββ compose.yaml # Docker Compose configuration
βββ backend.env # Backend environment variables
βββ main.go # Go backend server
βββ frontend/ # React frontend application
β βββ src/ # Source code
β β βββ components/ # React components
β β βββ App.tsx # Main application component
β β βββ ...
βββ pkg/ # Go packages
β βββ logger/ # Structured logging
β βββ metrics/ # Prometheus metrics
β βββ middleware/ # HTTP middleware
β βββ tracing/ # OpenTelemetry tracing
β βββ health/ # Health check endpoints
βββ prometheus/ # Prometheus configuration
βββ grafana/ # Grafana dashboards and configuration
βββ observability/ # Observability documentation
βββ ...
The application includes detailed llama.cpp metrics displayed directly in the UI:
- Tokens per Second: Real-time generation speed
- Context Window Size: Maximum tokens the model can process
- Prompt Evaluation Time: Time spent processing the input prompt
- Memory per Token: Memory usage efficiency
- Thread Utilization: Number of threads used for inference
- Batch Size: Inference batch size
These metrics help in understanding the performance characteristics of llama.cpp models and can be used to optimize configurations.
The project includes comprehensive observability features:
- Model performance (latency, time to first token)
- Token usage (input and output counts)
- Request rates and error rates
- Active request monitoring
- llama.cpp specific performance metrics
- Structured JSON logs with zerolog
- Log levels (debug, info, warn, error, fatal)
- Request logging middleware
- Error tracking
- Request flow tracing with OpenTelemetry
- Integration with Jaeger for visualization
- Span context propagation
For more information, see Observability Documentation.
The application has been enhanced with specific metrics for llama.cpp models:
-
Backend Integration: The Go backend collects and exposes llama.cpp-specific metrics:
- Context window size tracking
- Memory per token measurement
- Token generation speed calculations
- Thread utilization monitoring
- Prompt evaluation timing
- Batch size tracking
-
Frontend Dashboard: A dedicated metrics panel in the UI shows:
- Real-time token generation speed 80CE li>
- Memory efficiency
- Thread utilization with recommendations
- Context window size visualization
- Expandable detailed metrics view
- Integration with model info panel
-
Prometheus Integration: All llama.cpp metrics are exposed to Prometheus for long-term storage and analysis:
- Custom histograms for timing metrics
- Gauges for resource utilization
- Counters for token throughput
You can customize the application by:
- Changing the model in
backend.env
to use a different LLM - Modifying the frontend components for a different UI experience
- Extending the backend API with additional functionality
- Customizing the Grafana dashboards for different metrics
- Adjusting llama.cpp parameters for performance optimization
The project includes integration tests using Testcontainers:
cd tests
go test -v
- Model not loading: Ensure you've pulled the model with
docker model pull
- Connection errors: Verify Docker network settings and that Model Runner is running
- Streaming issues: Check CORS settings in the backend code
- Metrics not showing: Verify that Prometheus can reach the backend metrics endpoint
- llama.cpp metrics missing: Confirm that your model is indeed a llama.cpp model
MIT
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request