8000 Sometimes installation stuck, because no worker nodes added. · Issue #84 · stormshift/automation · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Sometimes installation stuck, because no worker nodes added. #84

New issue

Have a question about this project? 8000 Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rbo opened this issue Jan 7, 2025 · 6 comments
Open

Sometimes installation stuck, because no worker nodes added. #84

rbo opened this issue Jan 7, 2025 · 6 comments
Assignees

Comments

@rbo
Copy link
Member
rbo commented Jan 7, 2025

After many successful provisioning the installation never finishes:

Fetch the kubeconfig via

% ansible-navigator run scrible/download-kubeconfigs-from-vault.yaml --vault-password-file=.vault_pass -e @development-example.vars-private
% export KUBECONFIG=scrible/kubeconfig-stormshift-ocp1
% oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.17.0    False       False         True       120m    OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.93.181:443/healthz": dial tcp 172.30.93.181:443: connect: connection refused...
baremetal                                  4.17.0    True        False         False      118m
cloud-controller-manager                   4.17.0    True        False         False      124m
cloud-credential                           4.17.0    True        False         False      107m
cluster-autoscaler                                   True        False         True       117m    machine-api not ready
config-operator                            4.17.0    True        False         False      119m
console                                    4.17.0    False       False         True       105m    RouteHealthAvailable: console route is not admitted
control-plane-machine-set                  4.17.0    True        False         False      118m
csi-snapshot-controller                    4.17.0    True        False         False      119m
dns                                        4.17.0    True        False         False      95m
etcd                                       4.17.0    True        False         False      118m
image-registry                             4.17.0    True        False         False      105m
ingress                                              False       True          True       117m    The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights                                   4.17.0    True        False         False      107m
kube-apiserver                             4.17.0    True        False         False      115m
kube-controller-manager                    4.17.0    True        False         True       115m    GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
kube-scheduler                             4.17.0    True        False         False      114m
kube-storage-version-migrator              4.17.0    True        False         False      120m
machine-api                                          False       True          True       117m    Operator is initializing
machine-approver                           4.17.0    True        False         False      118m
machine-config                             4.17.0    True        False         False      118m
marketplace                                4.17.0    True        False         False      118m
monitoring                                           False       True          True       91m     UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: got 2 unavailable replicas
network                                    4.17.0    True        True          False      120m    Deployment "/openshift-network-console/networking-console-plugin" is waiting for other operators to become ready
node-tuning                                4.17.0    True        False         False      105m
openshift-apiserver                        4.17.0    True        False         False      111m
openshift-controller-manager               4.17.0    True        False         False      116m
openshift-samples                          4.17.0    True        False         False      108m
operator-lifecycle-manager                 4.17.0    True        False         False      117m
operator-lifecycle-manager-catalog         4.17.0    True        False         False      117m
operator-lifecycle-manager-packageserver   4.17.0    True        False         False      112m
service-ca                                 4.17.0    True        False         False      120m
storage                                    4.17.0    True        False         False      120m
% oc get nodes
NAME        STATUS   ROLES                  AGE    VERSION
ocp1-cp-1   Ready    control-plane,master   108m   v1.30.4
ocp1-cp-2   Ready    control-plane,master   126m   v1.30.4
ocp1-cp-3   Ready    control-plane,master   126m   v1.30.4
% oc get csr
No resources found

=> Worker nodes are missing

At the hosting cluster:

% oc get vm -n stormshift-ocp1-infra
oc get vm -n stormshift-ocp1-infra
NAME            AGE    STATUS    READY
ocp1-cp-0       142m   Running   True
ocp1-cp-1       141m   Running   True
ocp1-cp-2       141m   Running   True
ocp1-worker-0   140m   Running   True
ocp1-worker-1   139m   Running   True
ocp1-worker-2   139m   Running   True

=> ocp1-worker-1 printed

Screenshot 2025-01-07 at 20 06 40

After reboot, it joined the cluster to finish the setup...

@rbo
Copy link
Member Author
rbo commented Jan 7, 2025

Rebooted all three worker nodes by hand...

@rbo
Copy link
Member Author
rbo commented Jan 7, 2025

Installed...

@rbo
Copy link
Member Author
rbo commented Feb 21, 2025

Last prov. of ocp1 runs without any problems.

@rbo rbo closed this as completed Feb 21, 2025
@rbo rbo changed the title Installation of ocp1 never finishes Sometimes installation stuck, because no worker nodes added. Feb 21, 2025
@rbo
Copy link
Member Author
rbo commented Feb 21, 2025

Had this with ocp7 and ocp9 prov. as well.

@rbo rbo reopened this Feb 21, 2025
@rbo
Copy link
Member Author
rbo commented Feb 21, 2025

Check worker ocp7-worker-0

...
Feb 21 16:53:03 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T16:53:03Z" level=info msg="Read disk 168.9 MiB/3.4 GiB (4%)\n"
...
Feb 21 16:55:15 ocp7-worker-0 installer[3112]: time="2025-02-21T16:55:15Z" level=info msg="Install complete.\n"
Feb 21 16:55:15 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T16:55:15Z" level=info msg="Install complete.\n"
Feb 21 16:55:15 ocp7-worker-0 installer[3112]: time="2025-02-21T16:55:15Z" level=info msg="Done writing image to disk"
Feb 21 16:55:15 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T16:55:15Z" level=info msg="Done writing image to disk"
Feb 21 16:55:15 ocp7-worker-0 installer[3112]: time="2025-02-21T16:55:15Z" level=info msg="Waiting for 2 ready masters"
Feb 21 16:55:15 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T16:55:15Z" level=info msg="Waiting for 2 ready masters"
Feb 21 16:55:15 ocp7-worker-0 installer[3112]: time="2025-02-21T16:55:15Z" level=info msg="Updating node installation stage: Waiting for control plane - " request_id=a103a376-04ec-4307-b748-fc9333871712
Feb 21 16:55:15 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T16:55:15Z" level=info msg="Updating node installation stage: Waiting for control plane - " request_id=a103a376-04ec-4307-b748-fc9333871712
...
Feb 21 17:26:57 ocp7-worker-0 logs_sender[3321]: time="21-02-2025 17:26:57" level=error msg="failed to send log progress collecting to service" file="send_logs.go:333" error="Put \"http://10.32.105.100:8090/api/assisted-install/v>
Feb 21 17:26:57 ocp7-worker-0 logs_sender[3321]: time="21-02-2025 17:26:57" level=info msg="Archiving /var/log/logs_host_faed97a2-fd6b-58bb-afe5-f14521c11bd2 and creating /var/log/logs.tar.gz" file="send_logs.go:336"
Feb 21 17:26:57 ocp7-worker-0 logs_sender[3321]: time="21-02-2025 17:26:57" level=info msg="Executing tar [-czvf /var/log/logs.tar.gz -C /var/log logs_host_faed97a2-fd6b-58bb-afe5-f14521c11bd2]" file="execute.go:39"
Feb 21 17:26:57 ocp7-worker-0 logs_sender[3321]: time="21-02-2025 17:26:57" level=error msg="Failed to upload file /var/log/logs.tar.gz to assisted-service" file="send_logs.go:348" error="Post \"http://10.32.105.100:8090/api/assi>
Feb 21 17:26:57 ocp7-worker-0 focused_boyd[3319]: Failed to run send logs  Post "http://10.32.105.100:8090/api/assisted-install/v2/clusters/c8e8cab5-86e1-4677-8d35-184297821fae/logs?host_id=faed97a2-fd6b-58bb-afe5-f14521c11bd2&in>
Feb 21 17:26:57 ocp7-worker-0 installer[3112]: time="2025-02-21T17:26:57Z" level=info msg="Failed to run send logs  Post \"http://10.32.105.100:8090/api/assisted-install/v2/clusters/c8e8cab5-86e1-4677-8d35-184297821fae/logs?host_>
Feb 21 17:26:57 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T17:26:57Z" level=info msg="Failed to run send logs  Post \"http://10.32.105.100:8090/api/assisted-install/v2/clusters/c8e8cab5-86e1-4677-8d35-184297821fae/l>
Feb 21 17:26:57 ocp7-worker-0 systemd[1]: libpod-030b31aebf34c87dbf9404edcff185300b988450b219413e77b4fa397743bc9c.scope: Deactivated successfully.
Feb 21 17:26:57 ocp7-worker-0 podman[3304]: 2025-02-21 17:26:57.847116497 +0000 UTC m=+1.841698380 container died 030b31aebf34c87dbf9404edcff185300b988450b219413e77b4fa397743bc9c (image=quay.io/openshift-release-dev/ocp-v4.0-art->
Feb 21 17:26:57 ocp7-worker-0 systemd[1]: var-lib-containers-storage-overlay\x2dcontainers-030b31aebf34c87dbf9404edcff185300b988450b219413e77b4fa397743bc9c-userdata-shm.mount: Deactivated successfully.
Feb 21 17:26:57 ocp7-worker-0 systemd[1]: var-lib-containers-storage-overlay-b13b05b38a5bd8d6492ebc9f788d36cee27b07112a21e395efd48dbbd846e5c8-merged.mount: Deactivated successfully.
Feb 21 17:26:57 ocp7-worker-0 podman[3345]: 2025-02-21 17:26:57.904760854 +0000 UTC m=+0.044538596 container remove 030b31aebf34c87dbf9404edcff185300b988450b219413e77b4fa397743bc9c (image=quay.io/openshift-release-dev/ocp-v4.0-ar>
Feb 21 17:26:57 ocp7-worker-0 systemd[1]: libpod-conmon-030b31aebf34c87dbf9404edcff185300b988450b219413e77b4fa397743bc9c.scope: Deactivated successfully.
Feb 21 17:26:57 ocp7-worker-0 installer[3112]: time="2025-02-21T17:26:57Z" level=info msg="failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privileged --net=host --pid=host -v /run/s>
Feb 21 17:26:57 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T17:26:57Z" level=info msg="failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privileged --net=host --pid=host >
Feb 21 17:26:57 ocp7-worker-0 installer[3112]: time="2025-02-21T17:26:57Z" level=error msg="upload installation logs failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privileged --net>
Feb 21 17:26:57 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T17:26:57Z" level=error msg="upload installation logs failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privile>
Feb 21 17:26:57 ocp7-worker-0 installer[3112]: time="2025-02-21T17:26:57Z" level=info msg="Updating node installation stage: Rebooting - " request_id=923283b6-a780-40c9-841d-33ae89f5f80e
Feb 21 17:26:57 ocp7-worker-0 assisted-installer[3110]: time="2025-02-21T17:26:57Z" level=info msg="Updating node installation stage: Rebooting - " request_id=923283b6-a780-40c9-841d-33ae89f5f80e

=> not reboot happen!

@rbo
Copy link
Member Author
rbo commented Feb 21, 2025

@rbo rbo self-assigned this Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0