8000 (serve.llm) Remove test leakage from placement bundle logic by nrghosh · Pull Request #53723 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

(serve.llm) Remove test leakage from placement bundle logic #53723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 27, 2025

Conversation

nrghosh
Copy link
Contributor
@nrghosh nrghosh commented Jun 10, 2025

Fixing Test Code Leakage in LLM Placement Bundle Logic

Problem

Test Code Leakage: Production code had test-specific logic that didn't follow clean separation of concerns:

  • Tests specified accelerator_type: A10G in YAML configs, causing placement_bundles to return [{"GPU": 1, "accelerator_type:A10G": 0.001}]
  • In CI environments without actual GPU hardware, Ray couldn't schedule these resources
  • Workaround mock_resource: 0 check was embedded in production code to bypass this and set bundles = []
  • Test infrastructure was coupled to production logic, which is not ideal

Solution

Clean Test/Production Separation: Implemented proper test fixtures that isolate test behavior without modifying production code:

  1. Test Fixture Approach (conftest.py):

    • Created disable_placement_bundles fixture using unittest.mock.patch
    • Patches VLLMEngineConfig.placement_bundles property to return [] during tests
    • Automatically applied to test fixtures that need GPU-free execution
  2. Subprocess Test Handling:

    • Initial monkey-patch only worked for in-process pytest tests
    • Added explicit CPU resource configuration for subprocess tests (serve run)
    • Used resources_per_bundle: {"CPU": 1} for CPU-only testing scenarios
  3. Production Code Cleanup:

    • Removed mock_resource checks from server_models.py
    • Eliminated test-specific environment variables that leaked into production
    • Restored clean placement bundle logic for production use

Test Coverage

Fixed Test Categories:

  • CPU tests using mock engines (no longer hang waiting for GPU resources)
  • Subprocess tests using serve run (proper resource declarations)
  • GPU integration tests with mock engines (clean separation of mock vs real engines)

Key Tests Fixed:

  • test_build_openai_app_with_config (CPU subprocess test)
  • test_lora_unavailable_base_model (CPU mock test)
  • test_models (GPU integration test with mock engine)

Architecture

Before (Test Leakage):

# Production code contained test logic
if "mock_resource" in config:
    return []  # Test-specific behavior in prod!

After (Clean Separation):

# Test fixture handles test-specific behavior
@pytest.fixture
def disable_placement_bundles():
    with patch.object(VLLMEngineConfig, "placement_bundles", 
                     new_callable=lambda: property(lambda self: [])):
        yield

Approach Evolution

Earlier broad monkey-patching broke existing tests, which led to a more refined approach:

  • Use targeted resource declarations (resources_per_bundle: {"CPU": 1}) for CPU-only scenarios
  • Preserve all core placement bundle functionality for production
  • Keep backward compatibility

@nrghosh nrghosh requested a review from a team as a code owner June 10, 2025 22:44
@nrghosh nrghosh added the go add ONLY when ready to merge, run all tests label Jun 10, 2025
@nrghosh nrghosh changed the title (serve.llm) remove test leakage from placement bundle logic (serve.llm) Remove test leakage from placement bundle logic Jun 10, 2025
@nrghosh nrghosh requested a review from kouroshHakha June 10, 2025 23:53
@nrghosh nrghosh force-pushed the fix-test_code_leakage branch 2 times, most recently from 8ee9b19 to b20ea10 Compare June 11, 2025 17:55
@nrghosh nrghosh requested a review from eicherseiji June 11, 2025 18:34
- Replace hacky fix to bypass placement groups when mock_resource was
  present
- Remove coupling between test infra and prod code
- Remove mock_resource checks, add test fixture to configure
  placement_bundles as [] when in testing
- Tests now run without requiring GPUs by using a test-layer patch
  instead of modifying production logic
- Addresses TODO (server_models.py)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh force-pushed the fix-test_code_leakage branch from b20ea10 to 78a9359 Compare June 11, 2025 20:50
@nrghosh nrghosh marked this pull request as draft June 11, 2025 21:04
@nrghosh nrghosh removed the request for review from kouroshHakha June 11, 2025 21:05
- Root Cause: Patch in conftest.py only worked within the pytest process, but the failing test (test_build_openai_app_with_config) spawned a subprocess using serve run, which didn't inherit the monkey patch
- Monkey patch for in-process tests: Continues to work for regular pytest tests
- Environment variable for subprocesses: RAYLLM_TEST_DISABLE_PLACEMENT_GROUPS=1 is set by the fixture and checked
by the production code

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh removed the go add ONLY when ready to merge, run all tests label Jun 14, 2025
- stop using env variable to control placement groups
- monkey patching (global) broke tests by overriding all
- tests ensure placement bundle logic works
- eliminate leakage while still maintaining test coverage
- fix tests by relaxing some assertions on serve_options dict

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh marked this pull request as ready for review June 18, 2025 19:58
nrghosh added 2 commits June 18, 2025 16:55
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Jun 19, 2025
Copy link
Contributor
@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. waiting for the tests.

@kouroshHakha kouroshHakha enabled auto-merge (squash) June 19, 2025 18:47
…resource needs. apply placement bundle override to overlooked tests, fix type errors

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@github-actions github-actions bot disabled auto-merge June 20, 2025 01:19
nrghosh added 2 commits June 23, 2025 12:55
The refactoring to explicit placement group mocking missed some fixtures
that use mock engines but still tried to create real placement groups.

Changes:
- Add disable_placement_bundles dependency to testing_model fixtures in conftest.py
- Set accelerator_type: None in test_application_builders.py subprocess configs
- Remove environment variable approach to avoid test code leakage

This fixes hanging tests in both CPU and GPU test suites that use mock
engines (MockVLLMEngine) but were attempting to create real GPU placement
groups. Mock vs real engine tests need different placement
 group handling, not CPU vs GPU tests.

Fixes:
- test_build_openai_app_with_config (CPU subprocess test)
- test_lora_unavailable_base_model (CPU mock test)
- test_models (GPU integration test with mock engine)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…t groups

- fix broad initial approach that was breaking some tests
- more targeted approach (explicitly declare test resources in config)
- Revert core placement_bundles logic to preserve production behavior
- Add resources_per_bundle: {"CPU": 1} to test fixture for CPU-only testing
- Fixes test_build_openai_app_with_config deployment timeouts without breaking existing tests

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh force-pushed the fix-test_code_leakage branch from b032aef to 2f67dc1 Compare June 24, 2025 01:41
nrghosh and others added 3 commits June 23, 2025 23:50
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>


@pytest.fixture
def disable_placement_bundles():
Copy link
Contributor
@eicherseiji eicherseiji Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in a future change--

To avoid adding disable_placement_bundles each to each test, could we instead

  • create a new conftest.py under cpu
  • move this fixture declaration there
  • and use @pytest.fixture(autouse=True)?

@eicherseiji eicherseiji self-requested a review June 26, 2025 22:51
@kouroshHakha kouroshHakha merged commit 2140faa into ray-project:master Jun 27, 2025
5 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…ect#53723)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0