Description
What happened + What you expected to happen
After executing ray up
and waiting until the command finishes, I can't use cluster properly, because container is in "Exited" status (I confirmed it by doing raw ssh into vm) -- if I will run ray exec
or ray attach
I will see an error "No such container" with the name of container fromm ray's config. It happens on google cloud with vm's image "projects/deeplearning-platform-release/global/images/family/common-cpu" (this image was in default gcp config in ray's repo). I managed to trace the cause of a problem -- this particular vm has c2d-startup script which runs several other scripts, and one of them restarts docker engine after ray started its container, which ofc will stop ray's container, but ray will not restart it (the hacky solution I used is just to call again ray up
immediately after the first call -- this healed the cluster). In journalctl -xe
this looks something like this:
Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5661]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: (CRON) info (No MTA installed, discarding output)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session closed for user root
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce-rootless-extras (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libc-dev-bin (2.28-10+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libdns-export1104 (1:9.11.5.P4+dfsg-5.1+deb10u8) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up isc-dhcp-client (4.4.1-2+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopping Docker Application Container Engine...
-- Subject: A stop job for unit docker.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A stop job for unit docker.service has begun execution.
--
-- The job identifier is 1667.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.503321085Z" level=info msg="Processing signal 'terminated'"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607816080Z" level=info msg="shim disconnected" id=4f0cd322189e45d38de56400315d112b2f42209ea
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607911268Z" level=warning msg="cleaning up after shim disconnected" id=4f0cd322189e45d38de5
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607930270Z" level=info msg="cleaning up dead shim"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.607866268Z" level=info msg="ignoring event" container=4f0cd322189e45d38de56400315d112b2f42209e
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.620542271Z" level=warning msg="cleanup warnings time=\"2022-10-25T17:09:03Z\" level=info ms
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675187400Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" m
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675463479Z" level=info msg="Daemon shutdown complete"
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: docker.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit docker.service has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopped Docker Application Container Engine.
-- Subject: A stop job for unit docker.service has finished
By using vm's image "projects/cos-cloud/global/images/cos-101-17162-40-16" the problem seems to disappear.
So I'm thinking whether this problem could somehow be addressed in docs, as the full example contains this image. Not sure, whether it is simple to implement some kind of synchronization during initialization of ray up
to avoid this problem consistently.
Versions / Dependencies
2.0.0
Reproduction script
Issue Severity
Low: It annoys or frustrates me.