Open
Description
Description
With Ray starting to support the virtual cluster (vCluster) concept and we are seeing advanced multi-cluster per user setups, the Ray dashboard components should not be bound to a single Ray cluster's lifetime anymore, since it makes multi-tenancy sharing and telemetry data persistence complex to implement. Plus that the dashboard would go down together if the head node goes down (fate-sharing), making it difficult to backtrack what happened (and what was executing) during a major incident. @liuxsh9 @Bye-legumes @nemo9cby
Use case
Doing so will bring below benefits:
- Dashboard can optionally read from a persistence history server (observability database) instead of pulling directly from a running GCS. (GCS/HA redis writes to persistence store)
- Dashboard side overhead will not accidentally bring down the head node.
- Users can attach their own external monitoring platforms same way as job dashboard, to manage large amount of clusters.
- Each user gets their dashboard, which can be multi physical cluster or vclusters.
- Allow checking dashboard even after a cluster was preempted/shutdown.