Description
Description
I want to be able to run ray get-head-ip cluster-config.yaml
and ray get-worker-ips cluster-config.yaml
for a RayCluster created with kuberay.
Currently these commands for kuberay fail because they assume they're running inside the GKE cluster that the RayCluster is on with correct permissions. It'd be great if these commands also worked when run outside the cluster.
example cluster config
cat cluster-config.yaml
cluster_name: dxia-test
provider:
type: kuberay
namespace: hyperkube
worker_liveness_check: False
worker_rpc_drain: True
disable_node_updaters: True
disable_launch_config_check: True
foreground_node_launch: True
use_internal_ips: True
current behavior
❯ ray get-head-ip cluster-config.yaml
2023-01-29 12:32:59,109 INFO node_provider.py:211 -- Creating KuberayNodeProvider.
Traceback (most recent call last):
File "/Users/dxia/.pyenv/versions/hray/bin/ray", line 8, in <module>
sys.exit(main())
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2386, in main
return cli()
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1665, in get_head_ip
click.echo(get_head_node_ip(cluster_config_file, cluster_name))
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1295, in get_head_node_ip
provider = _get_node_provider(config["provider"], config["cluster_name"])
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/providers.py", line 229, in _get_node_provider
new_provider = provider_cls(provider_config, cluster_name)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 215, in __init__
self.headers, self.verify = load_k8s_secrets()
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 144, in load_k8s_secrets
with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as secret:
FileNotFoundError: [Errno 2] No such file or directory: '/var/run/secrets/kubernetes.io/serviceaccount/token'
❯ ray get-head-ip cluster-config.yaml
2023-01-29 12:35:43,105 INFO node_provider.py:211 -- Creating KuberayNodeProvider.
Traceback (most recent call last):
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/Users/dxia/.pyenv/versions/3.8.12/lib/python3.8/socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connection.py", line 358, in connect
self.sock = conn = self._new_conn()
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x125a111f0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/hyperkube/rayclusters/dxia-test (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125a111f0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/dxia/.pyenv/versions/hray/bin/ray", line 8, in <module>
sys.exit(main())
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2386, in main
return cli()
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1665, in get_head_ip
click.echo(get_head_node_ip(cluster_config_file, cluster_name))
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1296, in get_head_node_ip
head_node = _get_running_head_node(config, config_file, override_cluster_name)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1362, in _get_running_head_node
nodes = provider.non_terminated_nodes(head_node_tags)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/batching_node_provider.py", line 155, in non_terminated_nodes
self.node_data_dict = self.get_node_data()
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 232, in get_node_data
self._raycluster = self._get(f"rayclusters/{self.cluster_name}")
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 427, in _get
result = requests.get(url, headers=self.headers, verify=self.verify)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/Users/dxia/.pyenv/versions/hray/lib/python3.8/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/hyperkube/rayclusters/dxia-test (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125a111f0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
Once these lines are changed it works
https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/autoscaler/_private/kuberay/node_provider.py#L144 to file path of token
https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/autoscaler/_private/kuberay/node_provider.py#L150 to file path of cert
https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/autoscaler/_private/kuberay/node_provider.py#L170 to "https://K8S_API_HOSTNAME_OR_IP"
Use case
It's convenient to be able to run these ray
CLI commands outside the kuberay RayCluster. This will close the feature parity gap between kuberay clusters and other types of clusters.