Open
Description
What happened + What you expected to happen
When calling llm served by Ray LLM serving max_completion_tokens
isn't functioning at all. max_tokens
works. max_completion_tokens
seems to work if I run serving with vllm directly. Not a blocker at all, but could be beneficial to fix or at least document.
e.g.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "casperhansen/llama-3.3-70b-instruct-awq",
"messages": [
{"role": "user", "content": "hello, are you good at coding?"}
],
"max_completion_tokens": 10
}'
{"id":"casperhansen/llama-3.3-70b-instruct-awq-32641071-55cf-4f0c-809a-cb0cf06c11d4","object":"chat.completion","created":1750260587,"model":"casperhansen/llama-3.3-70b-instruct-awq","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Hello! I have been trained on a vast amount of code in various programming languages, including but not limited to Python, Java, JavaScript, C++, and many more. I can assist with:\n\n1. **Writing code**: I can help you write code from scratch or complete partially written code.\n2. **Debugging**: If you have code that's not working as expected, I can try to identify the issues and provide corrections.\n3. **Optimization**: I can suggest improvements to your code to make it more efficient, readable, or maintainable.\n4. **Explaining concepts**: If you're struggling to understand a particular programming concept, I can try to explain it in simpler terms.\n\nKeep in mind that I'm a large language model, I don't have personal experiences or direct coding experiences like a human developer would. However, I can provide information and guidance based on my training data.\n\nWhat kind of coding help do you need? Do you have a specific project or problem you'd like to work on?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":44,"total_tokens":250,"completion_tokens":206,"prompt_tokens_details":null},"prompt_logprobs":null}%
v.s.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "casperhansen/llama-3.3-70b-instruct-awq",
"messages": [
{"role": "user", "content": "hello, are you good at coding?"}
],
"max_tokens": 10
}'
{"id":"casperhansen/llama-3.3-70b-instruct-awq-bde387a5-a8dd-4db3-8e06-8dfc071e3d5d","object":"chat.completion","created":1750260595,"model":"casperhansen/llama-3.3-70b-instruct-awq","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Hello. I have been trained on a wide range","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":44,"total_tokens":54,"completion_tokens":10,"prompt_tokens_details":null},"prompt_logprobs":null}%
Looks like vllm serving engine is just doing a directly translation from max_completion_tokens
into sampling params. https://github.com/vllm-project/vllm/blob/v0.8.5.post1/vllm/entrypoints/openai/protocol.py#L461
Versions / Dependencies
ray==2.46.0
vllm==0.8.5.post1
Reproduction script
Attached above, simple HTTP request with max_completion_tokens
can reproduce.
Issue Severity
Low: It annoys or frustrates me.