Description
Based on https://github.com/NVIDIA-AI-Blueprints/llm-router where the flow is following: https://github.com/smarunich/llm-router/blob/main/llm-router-request-flow.md
How to achieve the Step 2? (https://github.com/smarunich/llm-router/blob/main/llm-router-request-flow.md#2-router-controller-to-router-server-triton)
i.e. should the transformation plugin be used with AI gateway or what is the framework to create a custom transformation plugin for below?
- Client sends OpenAI-compatible request to the gateway
- Gateway needs to transform this request into a format accepted by Triton Inference Server
- Specifically, we need to extract the last user message from the request and format it as:
{
"inputs": [
{
"name": "INPUT",
"datatype": "BYTES",
"shape": [1, 1],
"data": [["User message content"]]
}
]
}
or should be the EPP: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/README.md or https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/config/charts/inferencepool what used with Envoy AI Gateway to accomplish so?
Trying to understand the functionality of Envoy AI Gateway and how it can be consumed if it is already exist more concrete examples will help?
The https://github.com/NVIDIA-AI-Blueprints/llm-router project uses a single triton-server so test example probably will be more simple than a production version