8000 v1.11 backports 2022-06-21 by kkourt · Pull Request #20263 · cilium/cilium · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

v1.11 backports 2022-06-21 #20263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 30, 2022
Merged

Conversation

kkourt
Copy link
Contributor
@kkourt kkourt commented Jun 21, 2022

Skipped:

Once this PR is merged, you can update the PR labels via:

$ for pr in 19487 19992 20106 19522 20115 20131 19876 20158 20184 19996 20195 20206 20208  20238 20210; do contrib/backporting/set-labels.py $pr done 1.11; done

joestringer and others added 15 commits June 21, 2022 12:54
[ upstream commit 0e76fdb ]

Recent test runs have taken perhaps up to five minutes just to compile
various parts of the agent code from scratch with recent compilers, in
addition to taking a further ten minutes or so to run the tests. This is
close to the timeout, which means that occasionally the test will just
fail because it hits the timeout. Bump it to mitigate the issue.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 9ede0e6 ]

While there is an overall processing metric, it would be good to extract
the datapath update time from other sections of the FQDN processing.
This makes it easier to diagnose the source of any backups.

It should be noted that dataplaneTime will be capped to
option.Config.FQDNProxyResponseMaxDelay. After that time has elapsed,
the update is cancelled and the DNS packet forwarded to the endpoint.

Signed-off-by: Rahul Joshi <rkjoshi@google.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 520bf0c ]

Since etcd >= 3.5.0 the library started to remove the URL Scheme from
the URL making the custom dialer failing parse a "service ID" from the
URL. This commit introduces some logic that will allow it to parse a
"service ID" from the URL that doesn't have the URL Scheme, as per
etcd 3.5.0.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 4bcd7d2 ]

When a service matcher LRP and the selected backend pods
are deployed first, we previously didn't check if the LRP
frontend information (aka clusterIP) is available. This led to agent panic.
The frontend information is populated only when the LRP selected
service event is received.
This issue won't be hit when the selected service was deployed
prior to the LRP or backend pod.

Reported-by: Karsten Nielsen
Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit dd40215 ]

The tunnel port value is currently hardcoded in the datapath even though
we have a flag for users to redefine it. This commit fixes that.

Fixes: 0300772 ("datapath: Add a flag to set VXLAN and Geneve ports")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit d2aaba4 ]

In some clusters, the KVStore is using etcd-operator which depends on
Cilium agent to be up and ready. However, the Cilium agent, when running
with "ipam: cluster-pool", depends on the Cilium Operator to be
up-and-ready. Cilium Operator will not allocate any addresses because it
depends on the KVStore to be up and ready. To prevent this
chicken-and-egg problem, we will start the KVStore client of Cilium
Operator asynchronously which will then be able to allocate addresses to
Cilium Nodes, unblock Cilium which will be able to start the KVStore.

Fixes: c63b9f5 ("operator: merge duplicate k8s watchers")
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 5a4d261 ]

Using a single event queue to handle events that require the KVStore and
nodeManager will depend on the KVStore client to be connected which will
prevent the nodeManager to handle events as well. To not depend on the
KVStore then we should have two separate event queue handlers so that
the nodeManager can handle IP Addresses in the cluster-pool ipam mode.
This will allow clusters that have etcd-operator, which depend on the
Cilium agent to have an IP addresses allocated to the CiliumNode where
it is running.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 845565e ]

Signed-off-by: Bryan Stenson <bryan.stenson@okta.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit a91e00e ]

When retrying we have to explicitly use the previously fetched
nodeResource in case we want to skip fetching it. Otherwise we end up
with a null pointer exception.

Fixes: 91e68c2 ("nodediscovery: ensure instanceID is not empty")
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit d06cc72 ]

This part of the contributing guide was pointing to the wrong location,
the upgrade guide has moved to another location for some time now.

Reported-by: Karsten Nielsen <karsten.nielsen@ingka.ikea.com>
Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 4bf70f1 ]

[ Backporterter's note: There was a conflict due to
cilium@a445f69
from cilium#19073 not have been
backported. I opted for modidyfing the patch to call into the global
IPCAche ]

Node restart may lead to ip addresses 'swap' for services.
In such case using ipcache.Delete may lead to deleting newly assigned ip address of other service.

Fixes: cilium#19958

Signed-off-by: Viktor Kuzmin <kvaster@gmail.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit ccf4743 ]

Reflect that cilium#17669 has been fixed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ Backporter's note: minor conflict due to another section. Pulled in
only the relevant change ]

[ upstream commit 590388d ]

Signed-off-by: Wojtek Czekalski <me@wczekalski.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ Backporter's note: changed patch to use v1.11 tags instead of latest]

[ upstream commit 00b2a2f ]

These will be tagged with :<SHA>-unstripped and :latest-unstripped and
are e.g. useful for profiling.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
[ upstream commit 8a0885e ]

IPCache entries each have an associated source, which determines if an
entry can be overwritten with information originating in a different
source. If the node manager detects that a change to a node was rejected
by the IPCache, it will skip informing any node subscribers about the
change.

For the `KubeAPIServer` source however, special care has to be taken.
While the IPCache entry itself may not be overwritten, auxiliary node
properties such as the health endpoints may still have changed and must
be propagated to all node subscribers (such as the health subsystem).

This commit adds a special check for `KubeAPIServer` entries which
ensures that even if the IPCache entry did not change, that we still
inform all node subscribers about the change. This fixes a bug where
health endpoint changes on the kubeapi-server node where not picked up
by remote cilium agent instances and thus those instances tried to reach
a stale endpoint.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
@kkourt kkourt requested review from a team as code owners June 21, 2022 12:05
@kkourt kkourt requested a review from pchaigno June 21, 2022 12:05
@kkourt kkourt added backport/1.11 kind/backports This PR provides functionality previously merged into master. labels Jun 21, 2022
Copy link
Member
@tklauser tklauser left a comment
< 8000 a href="#pullrequestreview-1013544599" class="d-none">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My change LGTM, thanks for adjusting the tags!

Copy link
Member
@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good!

Copy link
Member
@odinuge odinuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine look good!

@kkourt
Copy link
Contributor Author
kkourt commented Jun 21, 2022

/test-backport-1.11

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed:

Click to show.

Test Name

K8sHubbleTest Hubble Observe Test L7 Flow

Failure Output

FAIL: hubble observe query timed out on "Exitcode: 0 \nStdout:\n \t \nStderr:\n \t \n"

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9 so I can create one.

Copy link
Member
@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit looks good

@joestringer
Copy link
Member

Changed from me LGTM. Would be good to clarify with @jrfastab / @dylandreimerink on who will work on getting those backports in, as they are both very useful bugfixes. The runtime one just needs to remove the runtime image bump changes and then follow the docs page on updating the runtime images. Not sure about the rp_filter conflicts.

@kkourt
Copy link
Contributor Author
kkourt commented Jun 22, 2022

/test-gke

edit: was unsuccessful in triaging this failure (https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/8696/). Retrying the test. If it succeeds will create a flake issue.

@kkourt
Copy link
Contributor Author
kkourt commented Jun 22, 2022

/test-1.21-4.9

There are no obvious culprits in this backport's patch list. Retrying.

output:
FAIL: hubble observe query timed out on "Exitcode: 0 \nStdout:\n \t \nStderr:\n \t \n"
Expected
    <*errors.errorString | 0xc0005a17a0>: {
        s: "Expected string \"L7\" is not in the filter output of \"{$.Type}\": 10s timeout expired",
    }
to be nil
14:27:30 STEP: Removing visibility annotation on pod with labels id=app1,zgroup=testapp
time="2022-06-21T14:27:30Z" level=debug msg="running command: kubectl annotate pod -n 202206211427k8shubbletesthubbleobservetestl3l4flow -l id=app1,zgroup=testapp io.cilium.proxy-visibility-"
time="2022-06-21T14:27:30Z" level=error msg="Error executing command '/bin/bash bash -c kubectl exec -n kube-system cilium-jdlt2 -c cilium-agent -- hubble observe --follow --output=json --last 1 --type l7 --from-pod 202206211427k8shubbletesthubbleobservetestl3l4flow/app2-58757b7dd5-57d8n --to-namespace 202206211427k8shubbletesthubbleobservetestl3l4flow --to-label id=app1,zgroup=testapp --protocol http'" error="signal: killed"
time="2022-06-21T14:27:30Z" level=error msg="Error executing command '/bin/bash bash -c kubectl exec -n kube-system cilium-jdlt2 -c cilium-agent -- hubble observe --follow --output=json --last 1 --type l7 --from-pod 202206211427k8shubbletesthubbleobservetestl3l4flow/app2-58757b7dd5-57d8n --to-namespace 202206211427k8shubbletesthubbleobservetestl3l4flow --to-label id=app1,zgroup=testapp --protocol http'" error="signal: killed"
cmd: "kubectl exec -n kube-system cilium-jdlt2 -c cilium-agent -- hubble observe --follow --output=json --last 1 --type l7 --from-pod 202206211427k8shubbletesthubbleobservetestl3l4flow/app2-58757b7dd5-57d8n --to-namespace 202206211427k8shubbletesthubbleobservetestl3l4flow --to-label id=app1,zgroup=testapp --protocol http" exitCode: -1 duration: 10.113555138s stdout:

So we seem to be hitting the timeout in:
https://github.com/kkourt/cilium/blob/be13923a72a41bb04d59e60122384fab3c0d038a/test/k8sT/hubble.go#L239

https://github.com/kkourt/cilium/blob/be13923a72a41bb04d59e60122384fab3c0d038a/test/helpers/cons.go#L23

Maybe 10s is too short? On the other hand, there does not seem to be another issue reporting this.

@pchaigno
Copy link
Member

@kkourt That flake is #17307. The identifying factors are GKE + connection refused on readiness probes.

Copy link
Member
@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@dylandreimerink
Copy link
Member
dylandreimerink commented Jun 23, 2022

Backport for #20072 with resolved conflicts has been made in #20242, we can merge that PR or cherry-pick the commits from that into this PR.

Copy link
Member
@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my changes. Thanks!

@pchaigno
Copy link
Member
pchaigno commented Jun 28, 2022

This flake should now be fixed.
/ci-awscni-1.11

Copy link
Member
@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI now looks all good, and I checked the backports for Joe's PRs which are good as well.

Thanks!

@qmonnet qmonnet merged commit dcd96ee into cilium:v1.11 Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0