-
Notifications
You must be signed in to change notification settings - Fork 3.2k
v1.13 backport 2023-01-05 #22948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.13 backport 2023-01-05 #22948
Conversation
[ upstream commit 6d8f4a8 ] This work completes the standalone egress GW functionality by implementing generic nat64 without having to configure specific services for remap: On the Gateway node: # ./daemon/cilium-agent --enable-ipv4=true --enable-ipv6=true \ --datapath-mode=lb-only --bpf-lb-algorithm=maglev \ --bpf-lb-maglev-table-size=2039 --bpf-lb-acceleration=disabled \ --devices=enp5s0 --disable-envoy-version-check=true \ --enable-stateless-nat46x64=true # ./cilium/cilium service list ID Frontend Service Type Backend # # tcpdump -i enp5s0 port 80 -n [...] 12:06:05.823733 IP6 2a02:168:f656:0:1ac0:4dff:ff01:d5e6.54054 > 64:ff9b::8c52:7903.80: Flags [S], seq 2743947568, win 43200, options [mss 1440,sackOK,TS val 3724070640 ecr 0,nop,wscale 9], length 0 12:06:05.823745 IP 192.168.2.11.54054 > 140.82.121.3.80: Flags [S], seq 2743947568, win 43200, options [mss 1440,sackOK,TS val 3724070640 ecr 0,nop,wscale 9], length 0 12:06:05.830454 IP 140.82.121.3.80 > 192.168.2.11.54054: Flags [S.], seq 433570079, ack 2743947569, win 65535, options [mss 1436,sackOK,TS val 1565141015 ecr 3724070640,nop,wscale 10], length 0 12:06:05.830479 IP6 64:ff9b::8c52:7903.80 > 2a02:168:f656:0:1ac0:4dff:ff01:d5e6.54054: Flags [S.], seq 433570079, ack 2743947569, win 65535, options [mss 1436,sackOK,TS val 1565141015 ecr 3724070640,nop,wscale 10], length 0 [...] The client only needs to add a next hop entry in its routing table, no special source address selection is needed: # ip -6 r [...] 64:ff9b::/96 via 2a02:168:f656:0:1ac0:4dff:ff01:c164 dev enp5s0 metric 1024 pref medium [...] # curl --verbose "http://[64:ff9b::101:101]" * Trying 64:ff9b::101:101:80... * TCP_NODELAY set * Connected to 64:ff9b::101:101 (64:ff9b::101:101) port 80 (#0) > GET / HTTP/1.1 > Host: [64:ff9b::101:101] > User-Agent: curl/7.68.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 403 Forbidden < Date: Tue, 29 Nov 2022 13:30:30 GMT < Content-Type: text/plain; charset=UTF-8 < Content-Length: 16 < Connection: close < X-Frame-Options: SAMEORIGIN < Referrer-Policy: same-origin < Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0 < Expires: Thu, 01 Jan 1970 00:00:01 GMT < Server: cloudflare < CF-RAY: 771bb2c739d5cc3e-ZRH < * Closing connection 0 [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 225b31c ] [ Fixed merge conflict in test suite to use updated helm config. ] Rename the agent setting --enable-stateless-nat46x64 into a more generic --enable-nat46x64-gateway, given the plan is to enable both stateless and statefull translation in this case as both can be supported in parallel. This has not seen a stable release yet, so rename is fine. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
…atus [ upstream commit 64d7518 ] For cilium status dump we want to reflect the state of services vs gateway support on NAT46/64. Therefore extend both into objects which can be extended further if needed. Also for the GW we expose the current fixed default prefix. In future, this may be a list of prefixes. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
…status [ upstream commit 80d9cbc ] Auto-generated API code in here. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
…tatus [ upstream commit 94003f2 ] Add status dump and separate both Service/Gateway. Also dump the prefix for the gateway. It's hard coded here, but subsequent commit will move this into defaults package. In future this can be made configurable and potentially we might support multiple prefixes. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 4631570 ] Refactor the code such that we can easily expose it as a config knob and have the underlying code be generic. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 148f71a ] This cannot be true if err == nil and can also be removed in validateIPv6ClusterAllocCIDR. Also, s/CIDR length must be/prefix length must be/. Suggested-by: Timo Beckers <timo@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit d8c7d93 ] For new code, preference is to use net/netip package. Suggested-by: Timo Beckers <timo@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 3f616d4 ] Move neighbor handling code into helper functions as we're going to reuse them also in other locations. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit f8ca297 ] Given original requests might originate from the same L2 network, we need to record them in the neighbor table as otherwise the gateway node may not be aware of them since they are not in the kernel's neighbor table. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 1582b48 ] Remove any ambiguity around CB_NAT_46X64 when its read from the tail call CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS by setting it for the two locations where its not yet done. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 3f125d8 ] For the case when IPv4 replies comes from the remote destination and we need to retransform the packet into an IPv6 one via reverse SNAT, what is needed is fib lookup for the target ifindex as well as figuring out the next hop. While the code hits rev_nodeport_lb6(), we don't have a CT entry there given these NAT gateway entries are not service-related. What happened was that it hit upper stack for recirculation given we did not enter the branch where we do fib lookup and push it back out. This was also why for XDP some traffic could be observed via tcpdump which is obviously a bug. The two snat_remap* cases in IPv4 code path differ in that the first case the IPv4 reply from remote comes back in, and given a service must have been configured, it'll find a rev_nodeport_lb6() CT lookup, but for the generic stateful NAT64 GW gateway this is not the case. However, we still need to enter the fib_lookup(): While there is no reverse DNAT in this case, we need to find the ifindex, and lookup the next hop with fallback to our BPF neighbor table if the original client was not seen in the regular neighbor table of the kernel. This fixes the bug, and also addresses another one at the same time, that is, if the original request came from a client in the same L2, we can now figure out the recorded neighbor address via BPF map as fallback if it was not in the neighbor cache. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 433bddc ] This is just duplicated code from when the Ingress and Egress NAT paths were split up. In today's code, nothing ever sets the NAT46x64_MODE_XLATE flag before calling to CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS. Fixes: 536ad0c ("bpf/nodeport: split the cilium call ipv6 nodeport nat") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 57a59c4 ] Use the packet's actual IP header for the FIB lookup (instead of the CT tuple). They should be identical, but this (1) clarifies that we really use the content of the translated packet, and (2) allows us to run through the FIB code without building the CT tuple. To avoid a revalidation after the maybe_add_l2_hdr() call, we re-use the information already stored in the FIB param struct. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 443b75c ] The NAT64 reply path piggy-backs on the FIB code in IPv6 RevDNAT. But there's no more need to extract the CT tuple, the FIB code just uses the packet's IPv6 header now. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
/test-backport-1.13 Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bpf parts look good, thanks!
(Currently investigating the l4lb issue, can repro locally on tc BPF in master too. XDP works. Other than that, all ci green except ci-l4lb.) |
fc99e32
to
cb33581
Compare
[ upstream commit f9af2df ] CB_SRC_IDENTITY is defined as 0 in nodeport header. This is actually the same as CB_SRC_LABEL defined in common.h. Clean this up to make it more obvious. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit ad11484 ] Remap CB_NAT_46X64 to CB_IFINDEX as it does collide in some cases with CB_SRC_LABEL inside nodeport_lb6(): [...] ctx_store_meta(ctx, CB_NAT_46X64, 0); ctx_store_meta(ctx, CB_SRC_LABEL, src_identity); ep_tail_call(ctx, CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS); return DROP_MISSED_TAIL_CALL; The issue has been observed in case of IPv4 frontend and IPv6 backend address. When the IPv6 reply from remote comes back to the gateway, then src_identity is 2 (world), and therefore CB_NAT_46X64 becomes true later on in the tail call even though it's not supposed to be. This has been observed in tc BPF while XDP BPF kept functioning given in tc BPF we extract identities in bpf_host: [...] case bpf_htons(ETH_P_IPV6): identity = resolve_srcid_ipv6(ctx, identity, from_host); ctx_store_meta(ctx, CB_SRC_LABEL, identity); [...] case bpf_htons(ETH_P_IP): identity = resolve_srcid_ipv4(ctx, identity, &ipcache_srcid, from_host); ctx_store_meta(ctx, CB_SRC_LABEL, identity); [...] The CB_IFINDEX does not collide with NAT46x64 given in case of LB it's only used in case of DSR to transfer IPv4/IPv6 service address info via tail call, whereas NAT46x64 gateway only operates in SNAT mode. Reproducer: # ./daemon/cilium-agent --enable-ipv4=true --enable-ipv6=true \ --datapath-mode=lb-only --bpf-lb-algorithm=maglev --bpf-lb-maglev-table-size=2039 \ --bpf-lb-acceleration=disabled --devices=enp5s0 --disable-envoy-version-check=true \ --enable-nat46x64-gateway=true The --enable-nat46x64-gateway must be enabled for this to trigger. # ./cilium/cilium service list ID Frontend Service Type Backend 2 2.2.2.2:80 ExternalIPs 1 => [2a03:2880:f11c:8183:face:b00c:0:25de]:80 (active) Before fix: # curl -4 --verbose http://2.2.2.2:80 * Trying 2.2.2.2:80... * TCP_NODELAY set < fail > After fix: # curl -4 --verbose http://2.2.2.2:80 * Trying 2.2.2.2:80... * TCP_NODELAY set * Connected to 2.2.2.2 (2.2.2.2) port 80 (#0) > GET / HTTP/1.1 > Host: 2.2.2.2 > User-Agent: curl/7.68.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 301 Moved Permanently < Location: https://2.2.2.2/ < Content-Type: text/plain < Server: proxygen-bolt [...] Reported-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
cb33581
to
e0b1dbe
Compare
/ci-l4lb-1.13 |
06f4ad8
to
1b34917
Compare
/test-backport-1.13 Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.18-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/test-upstream-k8s |
gke:
|
/test-runtime |
5e9db42
to
e0b1dbe
Compare
(L4LB suite requires a full backport/sync with master. Related GH, discussing with Martynas about needed PRs, then I'll give it a go. All other tests were green or known issues (above mentioned).) |
Manual 1.13 backport of:
Once this PR is merged, you can update the PR labels via: