-
Notifications
You must be signed in to change notification settings - Fork 3.3k
v1.11 backports 2022-06-21 #20263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.11 backports 2022-06-21 #20263
Conversation
[ upstream commit 0e76fdb ] Recent test runs have taken perhaps up to five minutes just to compile various parts of the agent code from scratch with recent compilers, in addition to taking a further ten minutes or so to run the tests. This is close to the timeout, which means that occasionally the test will just fail because it hits the timeout. Bump it to mitigate the issue. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 9ede0e6 ] While there is an overall processing metric, it would be good to extract the datapath update time from other sections of the FQDN processing. This makes it easier to diagnose the source of any backups. It should be noted that dataplaneTime will be capped to option.Config.FQDNProxyResponseMaxDelay. After that time has elapsed, the update is cancelled and the DNS packet forwarded to the endpoint. Signed-off-by: Rahul Joshi <rkjoshi@google.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 520bf0c ] Since etcd >= 3.5.0 the library started to remove the URL Scheme from the URL making the custom dialer failing parse a "service ID" from the URL. This commit introduces some logic that will allow it to parse a "service ID" from the URL that doesn't have the URL Scheme, as per etcd 3.5.0. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 4bcd7d2 ] When a service matcher LRP and the selected backend pods are deployed first, we previously didn't check if the LRP frontend information (aka clusterIP) is available. This led to agent panic. The frontend information is populated only when the LRP selected service event is received. This issue won't be hit when the selected service was deployed prior to the LRP or backend pod. Reported-by: Karsten Nielsen Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit dd40215 ] The tunnel port value is currently hardcoded in the datapath even though we have a flag for users to redefine it. This commit fixes that. Fixes: 0300772 ("datapath: Add a flag to set VXLAN and Geneve ports") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit d2aaba4 ] In some clusters, the KVStore is using etcd-operator which depends on Cilium agent to be up and ready. However, the Cilium agent, when running with "ipam: cluster-pool", depends on the Cilium Operator to be up-and-ready. Cilium Operator will not allocate any addresses because it depends on the KVStore to be up and ready. To prevent this chicken-and-egg problem, we will start the KVStore client of Cilium Operator asynchronously which will then be able to allocate addresses to Cilium Nodes, unblock Cilium which will be able to start the KVStore. Fixes: c63b9f5 ("operator: merge duplicate k8s watchers") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 5a4d261 ] Using a single event queue to handle events that require the KVStore and nodeManager will depend on the KVStore client to be connected which will prevent the nodeManager to handle events as well. To not depend on the KVStore then we should have two separate event queue handlers so that the nodeManager can handle IP Addresses in the cluster-pool ipam mode. This will allow clusters that have etcd-operator, which depend on the Cilium agent to have an IP addresses allocated to the CiliumNode where it is running. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 845565e ] Signed-off-by: Bryan Stenson <bryan.stenson@okta.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit a91e00e ] When retrying we have to explicitly use the previously fetched nodeResource in case we want to skip fetching it. Otherwise we end up with a null pointer exception. Fixes: 91e68c2 ("nodediscovery: ensure instanceID is not empty") Signed-off-by: Odin Ugedal <ougedal@palantir.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit d06cc72 ] This part of the contributing guide was pointing to the wrong location, the upgrade guide has moved to another location for some time now. Reported-by: Karsten Nielsen <karsten.nielsen@ingka.ikea.com> Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 4bf70f1 ] [ Backporterter's note: There was a conflict due to cilium@a445f69 from cilium#19073 not have been backported. I opted for modidyfing the patch to call into the global IPCAche ] Node restart may lead to ip addresses 'swap' for services. In such case using ipcache.Delete may lead to deleting newly assigned ip address of other service. Fixes: cilium#19958 Signed-off-by: Viktor Kuzmin <kvaster@gmail.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit ccf4743 ] Reflect that cilium#17669 has been fixed. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ Backporter's note: minor conflict due to another section. Pulled in only the relevant change ] [ upstream commit 590388d ] Signed-off-by: Wojtek Czekalski <me@wczekalski.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ Backporter's note: changed patch to use v1.11 tags instead of latest] [ upstream commit 00b2a2f ] These will be tagged with :<SHA>-unstripped and :latest-unstripped and are e.g. useful for profiling. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 8a0885e ] IPCache entries each have an associated source, which determines if an entry can be overwritten with information originating in a different source. If the node manager detects that a change to a node was rejected by the IPCache, it will skip informing any node subscribers about the change. For the `KubeAPIServer` source however, special care has to be taken. While the IPCache entry itself may not be overwritten, auxiliary node properties such as the health endpoints may still have changed and must be propagated to all node subscribers (such as the health subsystem). This commit adds a special check for `KubeAPIServer` entries which ensures that even if the IPCache entry did not change, that we still inform all node subscribers about the change. This fixes a bug where health endpoint changes on the kubeapi-server node where not picked up by remote cilium agent instances and thus those instances tried to reach a stale endpoint. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My change LGTM, thanks for adjusting the tags!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My changes look good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mine look good!
/test-backport-1.11 Job 'Cilium-PR-K8s-GKE' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My commit looks good
Changed from me LGTM. Would be good to clarify with @jrfastab / @dylandreimerink on who will work on getting those backports in, as they are both very useful bugfixes. The runtime one just needs to remove the runtime image bump changes and then follow the docs page on updating the runtime images. Not sure about the rp_filter conflicts. |
/test-gke edit: was unsuccessful in triaging this failure (https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/8696/). Retrying the test. If it succeeds will create a flake issue. |
/test-1.21-4.9 There are no obvious culprits in this backport's patch list. Retrying.
output:
So we seem to be hitting the timeout in: Maybe 10s is too short? On the other hand, there does not seem to be another issue reporting this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my changes. Thanks!
This flake should now be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI now looks all good, and I checked the backports for Joe's PRs which are good as well.
Thanks!
gandro
tklauser
odinuge
pchaigno
aanm
qmonnet
aditighag
julianwiedmann
joestringer
Successfully merging this pull request may close these issues.
Skipped:
rp_filter
overwrite config on agent init #20072 -- datapath: Create sysctlrp_filter
overwrite config on agent init (@dylandreimerink)Once this PR is merged, you can update the PR labels via: