8000 [FEAT] log ebpf errors by geyslan · Pull Request #2352 · aquasecurity/tracee · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[FEAT] log ebpf errors #2352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 15, 2022
Merged

Conversation

geyslan
Copy link
Member
@geyslan geyslan commented Nov 10, 2022

Description (git log)

commit cfd8611

pkg/ebpf: make string copy work in focal 4.19

commit ea4cac9

logger: warn event loss

Due ChanErrors removal #2403, warn about it.

commit ff4c077

logger: ebpf errors

Introduce EBPF error control mechanism - submission via perf buffer.
EBPF errors are aggregated before being submitted and are later
processed and logged by the processBPFErrors() go routine.

This starts the use of bpf error mechanism through some type checking
like BPF_ERR_MAP_UPDATE_ELEM, BPF_ERR_GET_CURRENT_COMM and
BPF_ERR_INIT_CONTEXT.

Fixes: #2174

Type of change

  • New feature (non-breaking change adding functionality).

How Has This Been Tested?

Toggled a checking (falsed an error) in the ebpf side to be able to send the error and get it in user space.

❯ TRACEE_LOGGER_LVL=debug sudo -E ./dist/tracee-ebpf -t comm=who
[sudo] password for gg: 
{"level":"debug","ts":1670529435.1558275,"msg":"osinfo","pkg":"urfave","file":"urfave.go","line":46,"OSReleaseField":"ID","OS_KERNEL_RELEASE":"manjaro"}
{"level":"debug","ts":1670529435.1558466,"msg":"osinfo","pkg":"urfave","file":"urfave.go","line":46,"OSReleaseField":"ID_LIKE","OS_KERNEL_RELEASE":"arch"}
{"level":"debug","ts":1670529435.155852,"msg":"osinfo","pkg":"urfave","file":"urfave.go","line":46,"OSReleaseField":"BUILD_ID","OS_KERNEL_RELEASE":"rolling"}
{"level":"debug","ts":1670529435.1558602,"msg":"osinfo","pkg":"urfave","file":"urfave.go","line":46,"OSReleaseField":"KERNEL_RELEASE","OS_KERNEL_RELEASE":"5.15.81-1-MANJARO"}
{"level":"debug","ts":1670529435.1558657,"msg":"osinfo","pkg":"urfave","file":"urfave.go","line":46,"OSReleaseField":"ARCH","OS_KERNEL_RELEASE":"x86_64"}
{"level":"debug","ts":1670529435.1558702,"msg":"osinfo","pkg":"urfave","file":"urfave.go","line":46,"OSReleaseField":"PRETTY_NAME","OS_KERNEL_RELEASE":"\"Manjaro Linux\""}
{"level":"debug","ts":1670529435.1560128,"msg":"osinfo","pkg":"urfave","file":"urfave.go","line":117,"security_lockdown":"none"}
TIME             UID    COMM             PID     TID     RET              EVENT                ARGS


{"level":"debug","ts":1670529436.1772962,"msg":"Failed to add or update a map element","id":2,"type":"BPF_ERR_MAP_UPDATE_ELEM","ret":0,"cpu":12,"file":"./pkg/ebpf/c/tracee.bpf.c","line":3514,"count":1}
{"level":"debug","ts":1670529436.1775048,"msg":"Failed to add or update a map element","id":2,"type":"BPF_ERR_MAP_UPDATE_ELEM","ret":0,"cpu":14,"file":"./pkg/ebpf/c/tracee.bpf.c","line":3514,"count":1}
...
{"level":"debug","ts":1670529440.1198132,"msg":"Failed to add or update a map element","id":2,"type":"BPF_ERR_MAP_UPDATE_ELEM","ret":0,"cpu":0,"file":"./pkg/ebpf/c/tracee.bpf.c","line":3514,"count":4}
^C
End of events stream
Stats: {EventCount:0 EventsFiltered:0 NetEvCount:0 BPFErrorsCount:15 ErrorCount:0 LostEvCount:0 LostWrCount:0 LostNtCount:0 LostBPFErrCount:0}

@geyslan geyslan mentioned this pull request Nov 11, 2022
1 task
@NDStrahilevitz NDStrahilevitz self-requested a review November 11, 2022 16:19
@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch from b170072 to 998bb75 Compare November 11, 2022 17:22
@geyslan geyslan marked this pull request as ready for review November 11, 2022 17:39
@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch from 998bb75 to baa6fe9 Compare November 11, 2022 17:45
@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch from baa6fe9 to 49035d9 Compare November 14, 2022 11:10
@geyslan

This comment was marked as resolved.

@geyslan
Copy link
Member Author
geyslan commented Nov 14, 2022

I made some declaration reordering and moved errors.go to bpf/kernel/errors.go, not pushed yet.

@NDStrahilevitz @yanivagman @rafaeldtinoco

About naming I would like to know from you what we would pick: BPFError or KernelError?

That choice would imply conformity to:

  • BPF_ERR_MAP_UPDATE_ELEM vs KERNEL_ERR_MAP_UPDATE_ELEM
  • struct bpf_error vs struct kernel_error
  • ...

The BPFError was chosen. Making changes accordingly.

@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch 4 times, most recently from 5b0cccf to 39538a4 Compare November 14, 2022 19:27
Copy link
Collaborator
@yanivagman yanivagman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch 2 times, most recently from 8f6bafb to 129460c Compare November 16, 2022 17:38
@geyslan geyslan changed the title log ebpf errors - 2nd log ebpf errors Nov 16, 2022
@geyslan
Copy link
Member Author
geyslan commented Nov 16, 2022

I had to reduce BPF_MAX_ERR_FUNC_LEN to 40 bytes to pass noncore tests. The error was related to stack size exceeded (512 bytes).

@yanivagman
Copy link
Collaborator
yanivagman commented Nov 20, 2022

One thing that we should be aware of when adding a new bpf error is the following:
If an error happens too frequently (e.g. map_update_element fails because map is full in some hot path) it will cause the errors perf buffer to get filled quickly and may consume a lot of cpu resources when processing these events in userspace.

We should consider adding a mechanism that will turn off the submission and logging of such errors when they happen too frequently (e.g. more than 10 errors per second), and re enable it some time later

@NDStrahilevitz
Copy link
Collaborator

One thing that we should be aware of when adding a new bpf error is the following: If an error happens too frequently (e.g. map_update_element fails because map is full in some hot path) it will cause the errors perf buffer to get filled quickly and may consume a lot of cpu resources when processing these events in userspace.

We should consider adding a mechanism that will turn off the submission and logging of such errors when they happen too frequently (e.g. more than 10 errors per second), and re enable it some time later

Maybe we should have log levels set (possibly with separate perf buffers too?), and then we set a log level from userspace and in ebpf we have some method LOG_EBPF_ERROR(level, msg) which internally will do a log level check and if relevant submit to the perf buffer.
So things like map_update_element failures may fit DEBUG level better than an error you always want.

@geyslan geyslan requested a review from yanivagman December 8, 2022 21:03
@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch from 25afc9d to 9b7c8d9 Compare December 8, 2022 21:04
@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch 4 times, most recently from 76db381 to d917403 Compare December 12, 2022 20:08
@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch 2 times, most recently from 237b85d to 7910130 Compare December 12, 2022 21:16
Copy link
Collaborator 6DB6
@NDStrahilevitz NDStrahilevitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator
@yanivagman yanivagman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rafaeldtinoco
Copy link
Contributor

Finally going through this now (and will merge it right after).

if (likely(counter != NULL)) {
ts_prev = counter->ts; // store previous ts

counter->count += 1;
Copy link
Contributor
@rafaeldtinoco rafaeldtinoco Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're changing a pointee value on the map without an atomic operation. I believe this is racy among different CPUs (as the "errors_count" map isn't a per CPU map). So, you might need either to copy the counter to the stack, increment and update the key with the new stack values (the key update is atomic) OR use atomic[64][fetch]add() for the pointer (not sure if that works for a map pointer, need to check).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, due to the nature of this counter, there is no problem if we loose a few increments (due to 2 cpus copying the values to the stack at the same time and both updating the key values with their stack values). With that, it would be just a question of memcping the pointer to the stack, incrementing and updating the key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not doing that might get us into a situation where count is in diff cachelines, and it gets badly corrupted, not that would impact too much, but, still...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geyslan made a good point that part of the key to this map is the cpu id.
In that case, it is not possible that two different CPUs will update the same counter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is a good point (haven't payed attention to it). Nice work @geyslan !!

@rafaeldtinoco
Copy link
Contributor

@geyslan, I left a commit at geyslan#7 changing very few things (basically formatting AND a suggestion to the map update concurrency issue). I have not tested, etc... so feel free to merge (as you got LGTM) after you check the suggestions and reasoning. Cheers!

Introduce EBPF error control mechanism - submission via perf buffer.
EBPF errors are aggregated before being submitted and are later
processed and logged by the processBPFErrors() go routine.

This starts the use of bpf error mechanism through some type checking
like BPF_ERR_MAP_UPDATE_ELEM, BPF_ERR_GET_CURRENT_COMM and
BPF_ERR_INIT_CONTEXT.
Due ChanErrors removal aquasecurity#2403, warn about it.
@geyslan geyslan force-pushed the 2174-err-perfbuffer-2 branch from 7910130 to cfd8611 Compare December 15, 2022 10:33
@yanivagman yanivagman merged commit 260f1f9 into aquasecurity:main Dec 15, 2022
@yanivagman yanivagman removed this from the v0.10.0 milestone Jan 2, 2023
@geyslan geyslan deleted the 2174-err-perfbuffer-2 branch May 29, 2023 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Enable logging from eBPF code
5 participants
0