8000 [CLI] make `ray get-head-ip` and `ray get-worker-ips` work for kuberay clusters when run outside the cluster · Issue #32037 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[CLI] make ray get-head-ip and ray get-worker-ips work for kuberay clusters when run outside the cluster #32037
Open
@davidxia

Description

@davidxia

Description

I want to be able to run ray get-head-ip cluster-config.yaml and ray get-worker-ips cluster-config.yaml for a RayCluster created with kuberay.

Currently these commands for kuberay fail because they assume they're running inside the GKE cluster that the RayCluster is on with correct permissions. It'd be great if these commands also worked when run outside the cluster.

example cluster config

cat cluster-config.yaml

cluster_name: dxia-test
provider:
    type: kuberay
    namespace: hyperkube
    worker_liveness_check: False
    worker_rpc_drain: True
    disable_node_updaters: True
    disable_launch_config_check: True
    foreground_node_launch: True
    use_internal_ips: True

current behavior

❯ ray get-head-ip cluster-config.yaml
2023-01-29 12:32:59,109	INFO node_provider.py:211 -- Creating KuberayNodeProvider.
Traceback (most recent call last):
  File "/Users/dxia/.pyenv/versions/hray/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2386, in main
    return cli()
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1665, in get_head_ip
    click.echo(get_head_node_ip(cluster_config_file, cluster_name))
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1295, in get_head_node_ip
    provider = _get_node_provider(config["provider"], config["cluster_name"])
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/providers.py", line 229, in _get_node_provider
    new_provider = provider_cls(provider_config, cluster_name)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 215, in __init__
    self.headers, self.verify = load_k8s_secrets()
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 144, in load_k8s_secrets
    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as secret:
FileNotFoundError: [Errno 2] No such file or directory: '/var/run/secrets/kubernetes.io/serviceaccount/token'
❯ ray get-head-ip cluster-config.yaml
2023-01-29 12:35:43,105	INFO node_provider.py:211 -- Creating KuberayNodeProvider.
Traceback (most recent call last):
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/Users/dxia/.pyenv/versions/3.8.12/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x125a111f0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/hyperkube/rayclusters/dxia-test (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125a111f0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dxia/.pyenv/versions/hray/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2386, in main
    return cli()
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1665, in get_head_ip
    click.echo(get_head_node_ip(cluster_config_file, cluster_name))
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1296, in get_head_node_ip
    head_node = _get_running_head_node(config, config_file, override_cluster_name)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1362, in _get_running_head_node
    nodes = provider.non_terminated_nodes(head_node_tags)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/batching_node_provider.py", line 155, in non_terminated_nodes
    self.node_data_dict = self.get_node_data()
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 232, in get_node_data
    self._raycluster = self._get(f"rayclusters/{self.cluster_name}")
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 427, in _get
    result = requests.get(url, headers=self.headers, verify=self.verify)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/hyperkube/rayclusters/dxia-test (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125a111f0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

Once these lines are changed it works

https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/autoscaler/_private/kuberay/node_provider.py#L144 to file path of token

https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/autoscaler/_private/kuberay/node_provider.py#L150 to file path of cert

https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/autoscaler/_private/kuberay/node_provider.py#L170 to "https://K8S_API_HOSTNAME_OR_IP"

Use case

It's convenient to be able to run these ray CLI commands outside the kuberay RayCluster. This will close the feature parity gap between kuberay clusters and other types of clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuescore-clustersFor launching and managing Ray clusters/jobs/kubernetesenhancementRequest for new feature and/or capabilityinfraautoscaler, ray client, kuberay, related issueskuberayIssues for the Ray/Kuberay integration that are tracked on the Ray sidepending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0