8000 [serve.llm] Ray LLM serving not respecting max_completion_tokens parameter · Issue #53922 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[serve.llm] Ray LLM serving not respecting max_completion_tokens parameter #53922
Open
@kanwang

Description

@kanwang

What happened + What you expected to happen

When calling llm served by Ray LLM serving max_completion_tokens isn't functioning at all. max_tokens works. max_completion_tokens seems to work if I run serving with vllm directly. Not a blocker at all, but could be beneficial to fix or at least document.

e.g.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "casperhansen/llama-3.3-70b-instruct-awq",
    "messages": [
      {"role": "user", "content": "hello, are you good at coding?"}
    ],
    "max_completion_tokens": 10
  }'

{"id":"casperhansen/llama-3.3-70b-instruct-awq-32641071-55cf-4f0c-809a-cb0cf06c11d4","object":"chat.completion","created":1750260587,"model":"casperhansen/llama-3.3-70b-instruct-awq","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Hello! I have been trained on a vast amount of code in various programming languages, including but not limited to Python, Java, JavaScript, C++, and many more. I can assist with:\n\n1. **Writing code**: I can help you write code from scratch or complete partially written code.\n2. **Debugging**: If you have code that's not working as expected, I can try to identify the issues and provide corrections.\n3. **Optimization**: I can suggest improvements to your code to make it more efficient, readable, or maintainable.\n4. **Explaining concepts**: If you're struggling to understand a particular programming concept, I can try to explain it in simpler terms.\n\nKeep in mind that I'm a large language model, I don't have personal experiences or direct coding experiences like a human developer would. However, I can provide information and guidance based on my training data.\n\nWhat kind of coding help do you need? Do you have a specific project or problem you'd like to work on?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":44,"total_tokens":250,"completion_tokens":206,"prompt_tokens_details":null},"prompt_logprobs":null}%

v.s.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "casperhansen/llama-3.3-70b-instruct-awq",
    "messages": [
      {"role": "user", "content": "hello, are you good at coding?"}
    ],
    "max_tokens": 10
  }'

{"id":"casperhansen/llama-3.3-70b-instruct-awq-bde387a5-a8dd-4db3-8e06-8dfc071e3d5d","object":"chat.completion","created":1750260595,"model":"casperhansen/llama-3.3-70b-instruct-awq","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Hello. I have been trained on a wide range","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":44,"total_tokens":54,"completion_tokens":10,"prompt_tokens_details":null},"prompt_logprobs":null}%

Looks like vllm serving engine is just doing a directly translation from max_completion_tokens into sampling params. https://github.com/vllm-project/vllm/blob/v0.8.5.post1/vllm/entrypoints/openai/protocol.py#L461

Versions / Dependencies

ray==2.46.0
vllm==0.8.5.post1

Reproduction script

Attached above, simple HTTP request with max_completion_tokens can reproduce.

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tllmserveRay Serve Related Issueusability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0