8000 [Core][State Observability] Improve stability at large scale cluster. · Issue #25718 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core][State Observability] Improve stability at large scale cluster.  #25718
Open
@rkooo567

Description

@rkooo567

Description

Currently, state APIs query all related sources and return the data. However, when there are a large amount of data, it can incur an unconventional degree of pressure & load to both the sources & API server.

To support stability in the large-scale cluster, we will ensure to bound the output size of API. More concretely, there will be 4 rules.

  • O(1) overhead per node per call (e.g., lim 1000 records)
  • O(n) overhead on API server per call (n: number of nodes).
  • O(1) final result size (e.g., lim 10000 records)
  • API server limits the number of concurrent requests.

Note that following method will have data loss. The data loss needs to be informed to users. In the long term, we can support pagination API to the source to obtain all data without data loss.

Use case

No response

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray Corecore-uxenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0