8000 [gcp] "No such container" error after ray up · Issue #29671 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[gcp] "No such container" error after ray up #29671
Open
@Tehada

Description

@Tehada

What happened + What you expected to happen

After executing ray up and waiting until the command finishes, I can't use cluster properly, because container is in "Exited" status (I confirmed it by doing raw ssh into vm) -- if I will run ray exec or ray attach I will see an error "No such container" with the name of container fromm ray's config. It happens on google cloud with vm's image "projects/deeplearning-platform-release/global/images/family/common-cpu" (this image was in default gcp config in ray's repo). I managed to trace the cause of a problem -- this particular vm has c2d-startup script which runs several other scripts, and one of them restarts docker engine after ray started its container, which ofc will stop ray's container, but ray will not restart it (the hacky solution I used is just to call again ray up immediately after the first call -- this healed the cluster). In journalctl -xe this looks something like this:

Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5661]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: (CRON) info (No MTA installed, discarding output)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session closed for user root
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce-rootless-extras (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libc-dev-bin (2.28-10+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libdns-export1104 (1:9.11.5.P4+dfsg-5.1+deb10u8) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up isc-dhcp-client (4.4.1-2+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopping Docker Application Container Engine...
-- Subject: A stop job for unit docker.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A stop job for unit docker.service has begun execution.
-- 
-- The job identifier is 1667.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.503321085Z" level=info msg="Processing signal 'terminated'"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607816080Z" level=info msg="shim disconnected" id=4f0cd322189e45d38de56400315d112b2f42209ea
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607911268Z" level=warning msg="cleaning up after shim disconnected" id=4f0cd322189e45d38de5
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607930270Z" level=info msg="cleaning up dead shim"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.607866268Z" level=info msg="ignoring event" container=4f0cd322189e45d38de56400315d112b2f42209e
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.620542271Z" level=warning msg="cleanup warnings time=\"2022-10-25T17:09:03Z\" level=info ms
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675187400Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" m
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675463479Z" level=info msg="Daemon shutdown complete"
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: docker.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit docker.service has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopped Docker Application Container Engine.
-- Subject: A stop job for unit docker.service has finished

By using vm's image "projects/cos-cloud/global/images/cos-101-17162-40-16" the problem seems to disappear.

So I'm thinking whether this problem could somehow be addressed in docs, as the full example contains this image. Not sure, whether it is simple to implement some kind of synchronization during initialization of ray up to avoid this problem consistently.

Versions / Dependencies

2.0.0

Reproduction script

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Issue moderate in impact or severitybugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed 3B15 in Ray Coreinfraautoscaler, ray client, kuberay, related issuespending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0