8000 Change how we correlate SSL to connection information by grcevski · Pull Request #1875 · grafana/beyla · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Change how we correlate SSL to connection information #1875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 1, 2025

Conversation

grcevski
Copy link
Contributor
@grcevski grcevski commented Apr 30, 2025

Until now, the way we correlated SSL information to a connection was based on looking for traffic while SSL_read or SSL_write were running. This sort of worked, but we recently found that it can be unstable with relation to sort interrupts. We can take a look at the following Beyla kernel trace:

           nginx-6195    [003] d... 285251.759323: bpf_trace_printk: tcp_sendmsg for identified SSL connection, ignoring...
           nginx-6195    [003] d... 285251.759324: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===
           nginx-6195    [003] d... 285251.759327: bpf_trace_printk: === kprobe tcp_rate_check_app_limited=6195 sock=ffff8eaf47c63780 ===
           nginx-6195    [003] d... 285251.759328: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=3012
           nginx-6195    [003] d... 285251.759328: bpf_trace_printk: [conn] d_h = 0, d_l = 64011aacffff0000, d_port=34418

>>> good correlation of the SSL 0x561882481e20 to the correct connection info
           nginx-6195    [003] d... 285251.759330: bpf_trace_printk: Correlating SSL 561882481e20 to connection
           nginx-6195    [003] d... 285251.759331: bpf_trace_printk: [conn] s_h = 0, s_l = 64011aacffff0000, s_port=34418
           nginx-6195    [003] d... 285251.759332: bpf_trace_printk: [conn] d_h = 0, d_l = c2141aacffff0000, d_port=3012
           nginx-6195    [003] d... 285251.759338: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===

We can notice here a start of a connection being established, followed by a send message to an unrelated port,
but it's being handled by our thread. This makes no sense, until we look at the printk flags

...d.s1.
 
s - means that we are handling a soft IRQ
1 - means that we are one level down preempted

So essentially, our work was interrupted and the thread was hijacked to handle the interrupt, causing traffic "under"
SSL_write which is unrelated.

           nginx-6195    [003] d.s1 285251.759389: bpf_trace_printk: === tcp_rcv_established id=6195 ===
           nginx-6195    [003] d.s1 285251.759400: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759400: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759410: bpf_trace_printk: === tcp_rcv_established id=6195 ===
           nginx-6195    [003] d.s1 285251.759412: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759413: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759415: bpf_trace_printk: === kprobe tcp_rate_check_app_limited=6195 sock=ffff8eaded168000 ===
           nginx-6195    [003] d.s1 285251.759417: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759418: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443

>>> correlating to the wrong information
           nginx-6195    [003] d.s1 285251.759420: bpf_trace_printk: Correlating SSL 561882481e20 to connection
           nginx-6195    [003] d.s1 285251.759421: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759422: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759424: bpf_trace_printk: marked buffer as inactive
           nginx-6195    [003] d.s1 285251.759425: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759426: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759427: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===
           nginx-6195    [003] d... 285251.759433: bpf_trace_printk: === kretprobe tcp_sendmsg=6195 sent 836===
           nginx-6195    [003] d... 285251.759439: bpf_trace_printk: === uretprobe SSL_write id=6195 args ffffb0024ccd9f38 ===

<<< SSL_write is complete, but the SSL pointer now is correlated to the wrong connection information.

           nginx-6195    [003] d... 285251.759441: bpf_trace_printk: SSL_buf id=6195 ssl=561882481e20
           nginx-6195    [003] d... 285251.759442: bpf_trace_printk: SSL conn
           nginx-6195    [003] d... 285251.759442: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d... 285251.759443: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d... 285251.759445: bpf_trace_printk: buf=[HTTP/1.1 200 OK
Server:�RG��V], pid=6195, len=807

>>> we can't find the original request for the request start and we fail to close the request.
           nginx-6195    [003] d... 285251.759446: bpf_trace_printk: No info, pid =6195?
           nginx-6195    [003] d... 285251.759448: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d... 285251.759449: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443

So, while are handling the SSL operations, we can be interrupted to handle something else on the same thread which throws a wrench in our SSL to connection tracking.

To counter that, we now do something else:

  1. We pull the file descriptor from the SSL BIO structs.
  2. We look up the thread task file descriptor table and pull the file * for the given file descriptor in SSL.
  3. We check to see if the file has a socket attached, which should always be the case for socket operations.
  4. We pull the connection information from the socket and associate it to the SSL pointer.

Testing:

  • Integration tests must pass, they do a lot of SSL connection checks
  • I'll test manually with istio book shop, which has some interesting corner cases.

@grcevski grcevski requested a review from a team as a code owner April 30, 2025 16:20
@grcevski grcevski added the port-to-otel-ebpf-inst PRs that need to be ported to otel-ebpf-instrumentation label Apr 30, 2025
@@ -31,7 +31,7 @@ static __always_inline u64 extra_runtime_id() {

static __always_inline u64 parent_runtime_id(u64 runtime_id) {
u64 lookup_id = runtime_id;
bpf_dbg_printk("parent lookup id %llx", lookup_id);
//bpf_dbg_printk("parent lookup id %llx", lookup_id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be printing inside loops to avoid large programs.

@codecov-commenter
Copy link
codecov-commenter commented Apr 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.41%. Comparing base (16a15f8) to head (5b8c42f).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1875      +/-   ##
==========================================
+ Coverage   73.96%   74.41%   +0.44%     
==========================================
  Files         177      177              
  Lines       19423    19432       +9     
==========================================
+ Hits        14367    14460      +93     
+ Misses       4319     4232      -87     
- Partials      737      740       +3     
Flag Coverage Δ
integration-test 57.66% <ø> (+<0.01%) ⬆️
k8s-integration-test 55.92% <ø> (+0.76%) ⬆️
oats-test 35.87% <ø> (+0.28%) ⬆️
unittests 46.00% <ø> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@grcevski
Copy link
Contributor Author

This actually didn't quite work. We can't read the SSL BIO like that for a few reasons:

  • Doesn't work with boringssl, they use a different struct.
  • Doesn't work with openssl 3.3+.

I'm working on a different approach.

Copy link
Contributor
@rafaelroquetto rafaelroquetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort!

@grcevski grcevski merged commit 327f939 into grafana:main May 1, 2025
14 checks passed
@grcevski grcevski deleted the ssl_fixes branch May 1, 2025 16:21
beyla-alloy-integration bot pushed a commit to grafana/opentelemetry-ebpf-instrumentation that referenced this pull request May 21, 2025
    Automated port of #1875 from grafana/beyla.
    Original PR: grafana/beyla#1875
beyla-alloy-integration bot pushed a commit to grafana/opentelemetry-ebpf-instrumentation that referenced this pull request May 21, 2025
    Automated port of #1875 from grafana/beyla.
    Original PR: grafana/beyla#1875
marctc pushed a commit to grafana/opentelemetry-ebpf-instrumentation that referenced this pull request May 26, 2025
    Automated port of #1875 from grafana/beyla.
    Original PR: grafana/beyla#1875
mariomac pushed a commit to open-telemetry/opentelemetry-ebpf-instrumentation that referenced this pull request May 27, 2025
* [port] #1875: Change how we correlate SSL to connection information

    Automated port of #1875 from grafana/beyla.
    Original PR: grafana/beyla#1875

* fix rejections

* format

* fix bpf code

* fix stuff

* delete files

---------

Co-authored-by: GitHub Actions Bot <actions@github.com>
@marctc marctc removed the port-to-otel-ebpf-inst PRs that need to be ported to otel-ebpf-instrumentation label May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0