Description
Is this the right place to submit this?
- This is not a security vulnerability or a crashing bug
- This is not a question about how to use Istio
Bug Description
On Friday, we made changes to our EKS cluster, which included removing and reinstalling the following components:
amazon-vpc-cni
coredns
ebs-csi-driver
The updates appeared successful at the time. However, over the weekend, we observed widespread instability across the cluster, with most workloads repeatedly restarting or becoming unavailable.
After log analysis, we could not find anything that jumped out at us, however, we discovered that removing and reapplying the istio.io/dataplane-mode=ambient label on affected namespaces immediately restored functionality.
To aid in further investigation, we have intentionally left one namespace in its broken state.
Request for Guidance:
We are seeking advice on what to investigate next. Specifically:
What Istio components or configurations could be affected by the removal and reinstallation of the VPC CNI, CoreDNS, or EBS CSI driver?
Are there known issues with Istio Ambient Mesh and namespace labeling that could cause this behavior?
What logs, metrics, or Istio control plane components should we focus on to better understand the root cause?
Any insights or suggestions would be greatly appreciated.
Version
Istio
client version: 1.24.1
control plane version: 1.25.0
data plane version: 1.25.0 (28 proxies)
Kubectl
Client Version: v1.25.4
Kustomize Version: v4.5.7
Server Version: v1.31.7-eks-4096722
Helm
v3.17.3+ge4da497
Additional Information
No response