Lingo is a lightweight, scale-from-zero ML model proxy and that runs on Kubernetes. Lingo allows you to run text-completion and embedding servers in your own project without changing any of your OpenAI client code.
🚀 Serve OSS LLMs on CPUs or GPUs
✅️ Compatible with the OpenAI API
⚖️ Scale from zero, autoscale based on load
… Queue requests to avoid overloading models
🛠️ Zero dependencies (no Istio, Knative, etc.)
⦿ Namespaced - no cluster privileges needed
Support the project by adding a star! ⭐️
This quickstart will walk through installing Lingo and demonstrating how it scales models from zero. This should work on any Kubernetes cluster (GKE, EKS, AKS, Kind).
Start by adding and updating the Substratus Helm repo.
helm repo add substratusai https://substratusai.github.io/helm
helm repo update
Install Lingo.
helm install lingo substratusai/lingo
Deploy an embedding model (runs on CPUs).
helm upgrade --install stapi-minilm-l6-v2 substratusai/stapi -f - << EOF
model: all-MiniLM-L6-v2
replicaCount: 0
deploymentAnnotations:
lingo.substratus.ai/models: text-embedding-ada-002
EOF
Deploy the Mistral 7B Instruct LLM using vLLM (GPUs are required).
helm upgrade --install mistral-7b-instruct substratusai/vllm -f - << EOF
model: mistralai/Mistral-7B-Instruct-v0.1
replicaCount: 0
env:
- name: SERVED_MODEL_NAME
value: mistral-7b-instruct-v0.1 # needs to be same as lingo model name
deploymentAnnotations:
lingo.substratus.ai/models: mistral-7b-instruct-v0.1
lingo.substratus.ai/min-replicas: "0" # needs to be string
lingo.substratus.ai/max-replicas: "3" # needs to be string
EOF
All model deployments currently have 0 replicas. Lingo will scale the Deployment in response to the first HTTP request.
By default, the proxy is only accessible within the Kubernetes cluster. To access it from your local machine, set up a port forward.
kubectl port-forward svc/lingo 8080:80
In a separate terminal watch the Pods.
watch kubectl get pods
Get embeddings by using the OpenAI compatible HTTP API.
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "Lingo rocks!",
"model": "text-embedding-ada-002"
}'
You should see a model Pod being created on the fly that will serve the request. The first request will wait for this Pod to become ready.
If you deployed the Mistral 7B LLM, try sending it a request as well.
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "mistral-7b-instruct-v0.1", "prompt": "<s>[INST]Who was the first president of the United States?[/INST]", "max_tokens": 40}'
The first request to an LLM takes longer because of the size of the model. Subsequent request should be much quicker.
Checkout substratus.ai to learn more about the managed hybrid-SaaS offering. Substratus allows you to run Lingo in your cloud account, while benefiting from extensive cluster performance addons that can dramatically reduce startup times and boost throughput.
Let us know about features you are interested in seeing or reach out with questions. Visit our Discord channel to join the discussion!
Or just reach out on LinkedIn if you want to connect: