v1.11 backports 2022-06-21 #20263

kkourt · 2022-06-21T12:05:23Z

runtime: Bump privileged test timeout #19487 -- runtime: Bump privileged test timeout (@joestringer)
Add metric on datapath update latency due to FQDN IP updates #19992 -- Add metric on datapath update latency due to FQDN IP updates (@rahulkjoshi)
pkg/k8s: fix custom dialer logic for etcd >= 3.5.0 #20106 -- pkg/k8s: fix custom dialer logic for etcd >= 3.5.0 (@aanm)
redirectpolicy: Fix panic in service matcher LPR processing #19522 -- redirectpolicy: Fix panic in service matcher LPR processing (@aditighag)
bpf: Use tunnel port flag instead of hardcoded value #20115 -- bpf: Use tunnel port flag instead of hardcoded value (@pchaigno)
operator: connect to the KVStore asynchronously #20131 -- operator: connect to the KVStore asynchronously (@aanm)
update-docs : add details for how to enable/disable Policy Audit Mode by endpoint #19876 -- update-docs : add details for how to enable/disable Policy Audit Mode by endpoint (@BryanStenson-okta)
nodediscovery: ensure we cache the nodeResource #20158 -- nodediscovery: ensure we cache the nodeResource (@odinuge)
docs: Fix reference to upgrade guide #20184 -- docs: Fix reference to upgrade guide (@joestringer)
Use DeleteOnMetadataMatch instead of Delete for endpointUpdated #19996 -- Use DeleteOnMetadataMatch instead of Delete for endpointUpdated (@kvaster)
- conflicit: because Refactor IPCache to remove static package-level globals #19073 is not backported. Used global IPCache
docs: remove stale EgressGW limitation with CES #20195 -- docs: remove stale EgressGW limitation with CES (@julianwiedmann)
helm: Templatize preflight and clustermesh-apiserver repos #20206 -- helm: Templatize preflight and clustermesh-apiserver repos (@michi-covalent)
Add a note about conflicting node CIDRs #20204 #20208 -- Add a note about conflicting node CIDRs [Docs] By default Cilium doesn't work if node is in 10.0.0.0/24 network #20204 (@wokalski)
- minor conflict
ci: provide CI images with unstripped binaries #20238 -- ci: provide CI images with unstripped binaries (@tklauser)
- used v1.11 tags
nodemanager: Fix bug where Cilium tried to reach stale health endpoints on kubeapi-server nodes #20210 -- nodemanager: Fix bug where Cilium tried to reach stale health endpoints on kubeapi-server nodes (@gandro)

Skipped:

cilium: fix conflicting iptables-legacy and iptables-nft rules #20123 -- cilium: fix conflicting iptables-legacy and iptables-nft rules (@jrfastab)
- there are conflicts and requires updates on the runtime images
datapath: Create sysctl rp_filter overwrite config on agent init #20072 -- datapath: Create sysctl rp_filter overwrite config on agent init (@dylandreimerink)
- due to conflicts

Once this PR is merged, you can update the PR labels via:

$ for pr in 19487 19992 20106 19522 20115 20131 19876 20158 20184 19996 20195 20206 20208  20238 20210; do contrib/backporting/set-labels.py $pr done 1.11; done

[ upstream commit 0e76fdb ] Recent test runs have taken perhaps up to five minutes just to compile various parts of the agent code from scratch with recent compilers, in addition to taking a further ten minutes or so to run the tests. This is close to the timeout, which means that occasionally the test will just fail because it hits the timeout. Bump it to mitigate the issue. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 9ede0e6 ] While there is an overall processing metric, it would be good to extract the datapath update time from other sections of the FQDN processing. This makes it easier to diagnose the source of any backups. It should be noted that dataplaneTime will be capped to option.Config.FQDNProxyResponseMaxDelay. After that time has elapsed, the update is cancelled and the DNS packet forwarded to the endpoint. Signed-off-by: Rahul Joshi <rkjoshi@google.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 520bf0c ] Since etcd >= 3.5.0 the library started to remove the URL Scheme from the URL making the custom dialer failing parse a "service ID" from the URL. This commit introduces some logic that will allow it to parse a "service ID" from the URL that doesn't have the URL Scheme, as per etcd 3.5.0. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 4bcd7d2 ] When a service matcher LRP and the selected backend pods are deployed first, we previously didn't check if the LRP frontend information (aka clusterIP) is available. This led to agent panic. The frontend information is populated only when the LRP selected service event is received. This issue won't be hit when the selected service was deployed prior to the LRP or backend pod. Reported-by: Karsten Nielsen Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit dd40215 ] The tunnel port value is currently hardcoded in the datapath even though we have a flag for users to redefine it. This commit fixes that. Fixes: 0300772 ("datapath: Add a flag to set VXLAN and Geneve ports") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit d2aaba4 ] In some clusters, the KVStore is using etcd-operator which depends on Cilium agent to be up and ready. However, the Cilium agent, when running with "ipam: cluster-pool", depends on the Cilium Operator to be up-and-ready. Cilium Operator will not allocate any addresses because it depends on the KVStore to be up and ready. To prevent this chicken-and-egg problem, we will start the KVStore client of Cilium Operator asynchronously which will then be able to allocate addresses to Cilium Nodes, unblock Cilium which will be able to start the KVStore. Fixes: c63b9f5 ("operator: merge duplicate k8s watchers") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 5a4d261 ] Using a single event queue to handle events that require the KVStore and nodeManager will depend on the KVStore client to be connected which will prevent the nodeManager to handle events as well. To not depend on the KVStore then we should have two separate event queue handlers so that the nodeManager can handle IP Addresses in the cluster-pool ipam mode. This will allow clusters that have etcd-operator, which depend on the Cilium agent to have an IP addresses allocated to the CiliumNode where it is running. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 845565e ] Signed-off-by: Bryan Stenson <bryan.stenson@okta.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit a91e00e ] When retrying we have to explicitly use the previously fetched nodeResource in case we want to skip fetching it. Otherwise we end up with a null pointer exception. Fixes: 91e68c2 ("nodediscovery: ensure instanceID is not empty") Signed-off-by: Odin Ugedal <ougedal@palantir.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit d06cc72 ] This part of the contributing guide was pointing to the wrong location, the upgrade guide has moved to another location for some time now. Reported-by: Karsten Nielsen <karsten.nielsen@ingka.ikea.com> Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 4bf70f1 ] [ Backporterter's note: There was a conflict due to cilium@a445f69 from cilium#19073 not have been backported. I opted for modidyfing the patch to call into the global IPCAche ] Node restart may lead to ip addresses 'swap' for services. In such case using ipcache.Delete may lead to deleting newly assigned ip address of other service. Fixes: cilium#19958 Signed-off-by: Viktor Kuzmin <kvaster@gmail.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit ccf4743 ] Reflect that cilium#17669 has been fixed. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ Backporter's note: minor conflict due to another section. Pulled in only the relevant change ] [ upstream commit 590388d ] Signed-off-by: Wojtek Czekalski <me@wczekalski.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ Backporter's note: changed patch to use v1.11 tags instead of latest] [ upstream commit 00b2a2f ] These will be tagged with :<SHA>-unstripped and :latest-unstripped and are e.g. useful for profiling. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

[ upstream commit 8a0885e ] IPCache entries each have an associated source, which determines if an entry can be overwritten with information originating in a different source. If the node manager detects that a change to a node was rejected by the IPCache, it will skip informing any node subscribers about the change. For the `KubeAPIServer` source however, special care has to be taken. While the IPCache entry itself may not be overwritten, auxiliary node properties such as the health endpoints may still have changed and must be propagated to all node subscribers (such as the health subsystem). This commit adds a special check for `KubeAPIServer` entries which ensures that even if the IPCache entry did not change, that we still inform all node subscribers about the change. This fixes a bug where health endpoint changes on the kubeapi-server node where not picked up by remote cilium agent instances and thus those instances tried to reach a stale endpoint. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

tklauser

< 8000 a href="#pullrequestreview-1013544599" class="d-none">

My change LGTM, thanks for adjusting the tags!

pchaigno

My changes look good!

odinuge

Mine look good!

kkourt · 2022-06-21T13:07:02Z

/test-backport-1.11

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed:

Click to show.

Test Name

K8sHubbleTest Hubble Observe Test L7 Flow

Failure Output

FAIL: hubble observe query timed out on "Exitcode: 0 \nStdout:\n \t \nStderr:\n \t \n"

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9 so I can create one.

gandro

My commit looks good

joestringer · 2022-06-22T09:57:46Z

Changed from me LGTM. Would be good to clarify with @jrfastab / @dylandreimerink on who will work on getting those backports in, as they are both very useful bugfixes. The runtime one just needs to remove the runtime image bump changes and then follow the docs page on updating the runtime images. Not sure about the rp_filter conflicts.

kkourt · 2022-06-22T13:07:07Z

/test-gke

edit: was unsuccessful in triaging this failure (https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/8696/). Retrying the test. If it succeeds will create a flake issue.

kkourt · 2022-06-22T13:21:24Z

/test-1.21-4.9

There are no obvious culprits in this backport's patch list. Retrying.

output:

FAIL: hubble observe query timed out on "Exitcode: 0 \nStdout:\n \t \nStderr:\n \t \n"
Expected
    <*errors.errorString | 0xc0005a17a0>: {
        s: "Expected string \"L7\" is not in the filter output of \"{$.Type}\": 10s timeout expired",
    }
to be nil
14:27:30 STEP: Removing visibility annotation on pod with labels id=app1,zgroup=testapp
time="2022-06-21T14:27:30Z" level=debug msg="running command: kubectl annotate pod -n 202206211427k8shubbletesthubbleobservetestl3l4flow -l id=app1,zgroup=testapp io.cilium.proxy-visibility-"
time="2022-06-21T14:27:30Z" level=error msg="Error executing command '/bin/bash bash -c kubectl exec -n kube-system cilium-jdlt2 -c cilium-agent -- hubble observe --follow --output=json --last 1 --type l7 --from-pod 202206211427k8shubbletesthubbleobservetestl3l4flow/app2-58757b7dd5-57d8n --to-namespace 202206211427k8shubbletesthubbleobservetestl3l4flow --to-label id=app1,zgroup=testapp --protocol http'" error="signal: killed"
time="2022-06-21T14:27:30Z" level=error msg="Error executing command '/bin/bash bash -c kubectl exec -n kube-system cilium-jdlt2 -c cilium-agent -- hubble observe --follow --output=json --last 1 --type l7 --from-pod 202206211427k8shubbletesthubbleobservetestl3l4flow/app2-58757b7dd5-57d8n --to-namespace 202206211427k8shubbletesthubbleobservetestl3l4flow --to-label id=app1,zgroup=testapp --protocol http'" error="signal: killed"
cmd: "kubectl exec -n kube-system cilium-jdlt2 -c cilium-agent -- hubble observe --follow --output=json --last 1 --type l7 --from-pod 202206211427k8shubbletesthubbleobservetestl3l4flow/app2-58757b7dd5-57d8n --to-namespace 202206211427k8shubbletesthubbleobservetestl3l4flow --to-label id=app1,zgroup=testapp --protocol http" exitCode: -1 duration: 10.113555138s stdout:

So we seem to be hitting the timeout in:
https://github.com/kkourt/cilium/blob/be13923a72a41bb04d59e60122384fab3c0d038a/test/k8sT/hubble.go#L239

https://github.com/kkourt/cilium/blob/be13923a72a41bb04d59e60122384fab3c0d038a/test/helpers/cons.go#L23

Maybe 10s is too short? On the other hand, there does not seem to be another issue reporting this.

pchaigno · 2022-06-22T13:40:29Z

@kkourt That flake is #17307. The identifying factors are GKE + connection refused on readiness probes.

aanm

Thank you!

dylandreimerink · 2022-06-23T12:00:09Z

Backport for #20072 with resolved conflicts has been made in #20242, we can merge that PR or cherry-pick the commits from that into this PR.

aditighag

Looks good for my changes. Thanks!

pchaigno · 2022-06-28T14:52:03Z

This flake should now be fixed.
/ci-awscni-1.11

qmonnet

CI now looks all good, and I checked the backports for Joe's PRs which are good as well.

Thanks!

joestringer and others added 15 commits June 21, 2022 12:54

add details for how to enable/disable Policy Audit Mode by endpoint

42fef32

[ upstream commit 845565e ] Signed-off-by: Bryan Stenson <bryan.stenson@okta.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

docs: remove stale EgressGW limitation with CES

57d1d6b

[ upstream commit ccf4743 ] Reflect that cilium#17669 has been fixed. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>

kkourt requested review from a team as code owners June 21, 2022 12:05

kkourt requested a review from pchaigno June 21, 2022 12:05

kkourt added backport/1.11 kind/backports This PR provides functionality previously merged into master. labels Jun 21, 2022

kkourt requested review from aanm, aditighag, joestringer, gandro, julianwiedmann and tklauser June 21, 2022 12:07

tklauser approved these changes Jun 21, 2022

View reviewed changes

pchaigno approved these changes Jun 21, 2022

View reviewed changes

odinuge approved these changes Jun 21, 2022

View reviewed changes

gandro approved these changes Jun 21, 2022

View reviewed changes

aanm approved these changes Jun 23, 2022

View reviewed changes

aditighag approved these changes Jun 23, 2022

View reviewed changes

julianwiedmann approved these changes Jun 27, 2022

View reviewed changes

qmonnet approved these changes Jun 30, 2022

View reviewed changes

qmonnet merged commit dcd96ee into cilium:v1.11 Jun 30, 2022

joestringer mentioned this pull request Jul 15, 2022

Prepare for release v1.11.7 #20546

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.11 backports 2022-06-21 #20263

v1.11 backports 2022-06-21 #20263

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Test Name

Failure Output

Test Name

Failure Output

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

v1.11 backports 2022-06-21 #20263

v1.11 backports 2022-06-21 #20263

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Test Name

Failure Output

Test Name

Failure Output

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!