8000 [Dashboard] Support for List Tasks Filter Pushdown · Issue #53970 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Dashboard] Support for List Tasks Filter Pushdown #53970
Open
@yancanmao

Description

@yancanmao

Description

Problem

When submitting jobs with a large number of tasks (e.g., 10,000+), the Ray Dashboard’s list_tasks related endpoints, i.e., /api/v0/tasks (used in the Task Table component) and /api/v0/tasks/summarize (used in Ray Core Overview component) in each job page, become very slow (>5s) or even unusable due to:

  • Backend CPU overload from full task deserialization and in-memory filtering.
  • gRPC message size limit (512MB) is exceeded when the task event data volume is too large.

Analysis

The root cause is that when the list_tasks API is called, the GCS server returns all task event records to the Dashboard backend, which must then fully deserialize, filter, and render them, regardless of user intent. This does not scale for jobs with tens of thousands of tasks.
While the GCS-side API supports filtering, the current frontend does not pass any filter parameters, and GCS itself does not support filtering by task scheduling state, which is a field that is critical for components such as Ray Core Overview (e.g., Hide Finished checkbox) and Task Table (e.g., Task state filters).

Proposed Solution

As a result, we propose a Filter Pushdown solution, where all Ray Dashboard list_tasks related filters (e.g., task state) are passed down to GCS when querying task events, to ensure Ray Dashboard remains usable even in extreme-scale production workloads. According to our internal experiments, this list task filter pushdown significantly reduces frontend response time when applying filters. It also avoids gRPC messages over size related problems.

For a detailed proposal, we propose enabling filter pushdown for list_tasks-related components in the Ray Dashboard to improve scalability, reduce load, and ensure responsiveness when visualizing large jobs. The proposal involves two parts:

Add Frontend Support for Filter Parameters

In the current Dashboard (as of the latest Ray release), the frontend does not pass any filters to the backend, i.e., all filtering is done in memory after fetching the full task list.
As part of this proposal, we propose modifying the frontend to encode user-specified filters (e.g., task state, name) into the HTTP request URL.
Example 1: Modified getTasks() used in Task Table

/api/v0/tasks?detail=1&limit=10000
  &filter_keys=state&filter_predicates=%3D&filter_values=FAILED # add state filter

Example 2: Modified getStateApiJobProgressByLineage() in Ray Core Overview

/api/v0/tasks/summarize?summary_by=lineage
  &filter_keys=job_id&filter_predicates=%3D&filter_values=01000000
  &filter_keys=state&filter_predicates=!%3D&filter_values=FINISHED # add state filter

Backend + GCS Support for Filter Fields

  1. Extend the proto fields to support filtering by task scheduling state.
  2. Add state filter logic to GCS Task Manager.

Use case

We aim to support a better Ray Dashboard experience for large-scale jobs (e.g., those with 10,000+ tasks).

For example, running the following code submits 10,000 tasks to a Ray cluster:

import ray

ray.init()

@ray.remote
def dummy_task(x):
    return x * x
# Submit 10k task for dashboard test.
ray.get([dummy_task.remote(i) for i in range(10000)])

Then, we can find two issues:

Navigate to Dashboard → Jobs → Select the job → Expand “Ray Core Overview”
→ triggers /api/v0/tasks/summarize
→ Observed delay: ~2.5 seconds

Navigate to Dashboard → Jobs → Select the job → Expand “Task Table”
→ triggers /api/v0/tasks
→ Observed delay: ~5 seconds

Profiling reveals that ~80% of the backend time is spent deserializing the full task event list.
In our production environments, this overhead can be even higher, especially when each task carries more metadata (e.g., user-specified runtime environments). In such cases, the excessive data volume has also led to gRPC message size limit errors.
This indicates that transferring and processing all task entries in the Dashboard server causes major performance bottlenecks.

By supporting filter pushdown to GCS (e.g., filtering by task state or name), we can significantly reduce unnecessary data transfer and processing, making the dashboard far more responsive and scalable.

Expected Benefits

  • Significantly reduce gRPC payload and backend processing overhead for those list tasks with filters
  • Ensure dashboard responsiveness and observability under large jobs
  • Prevent crashes caused by exceeding gRPC message size

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoredashboardIssues specific to the Ray DashboardenhancementRequest for new feature and/or capabilityperformanceusability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0