8000 Implement process tree filter by grantseltzer · Pull Request #927 · aquasecurity/tracee · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Implement process tree filter #927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Sep 1, 2021
Merged

Implement process tree filter #927

merged 27 commits into from
Sep 1, 2021

Conversation

grantseltzer
Copy link
Contributor
@grantseltzer grantseltzer commented Aug 16, 2021

Handles #836

Couple of design decisions I made worth discussing

  • Having support for multiple process trees would have too much of an impact on # of instructions, validator implications, and performance. I instead opted to only support one process tree filter. Hopefully this meets the use-case of CNDR.
  • I build the process tree after the bpf programs are attached. This removes the issue where procs start/end in between the process tree being built and when sched_proc_fork/sched_proc_exit take over maintenance
  • The flag is called tree and can be passed like --trace tree!=3 or --trace tree=1001

We must verify that this works on older kernels than my development one however. I'm worried about the possibility that this blows the instruction limit.

@yanivagman
Copy link
Collaborator

Doesn't work on 4.19:

[yaniv@yaniv-aqua tracee-ebpf]$ sudo ./dist/tracee-ebpf 
[sudo] password for yaniv: 
libbpf: load bpf program failed: Argument list too long
libbpf: Program too large (4795 insns), at most 4096 insns
libbpf: failed to load program 'tracepoint__raw_syscalls__sys_exit'
libbpf: failed to load object '/home/yaniv/src/tracee/tracee-ebpf/dist/tracee.bpf.4_19_201-1-MANJARO.v0_6_0-13-g7bb9807.o'
2021/08/17 10:29:37 error creating Tracee: failed to load BPF object

Copy link
Collaborator
@yanivagman yanivagman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping track of parent and childs relations complicates the implementation.
It requires to loop until parent is found, which wastes cpu cycles, and defines a max depth of the tree (currently set to 100, but probably we will need to lower that so it can work in older kernels <5.2).

Instead, you can use the implementation I described in #836:

  1. Define a new map in the kernel - process_tree_map with a key that will represent a pid, and a value that will reresent if the process (and its descendants) should be traced or not.
  2. Populate this map during the init phase of tracee, as described above
  3. Update sched_process_fork event as follows: If the parent pid is in process_tree_map, add the child pid as well with the same value as of its parent.
  4. Update should_trace to consult this filter (probably we can use the equality filter semantics here).
  5. Update sched_process_exit to delete the pid entry from the map.

Using this method will not require to perform any loop, and no max depth of the process tree will limit the implementation

Copy link
Collaborator
@yanivagman yanivagman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments inline.
Also, is there any reason to update libbpf files here?

if err != nil {
return fmt.Errorf("invalid PID given to filter: %s", valuesString)
}
procTreeFilter.PID = uint32(pid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why not to support multiple tree filters?

for i := range procs {
stat, err := procs[i].Stat()
if err != nil {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this retrun an error if a process exited while iterating procs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this, and yes you suspect correctly that it will.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to fail in that case?

if err != nil {
return nil, err
}
processMap[uint32(stat.PPID)] = append(processMap[uint32(stat.PPID)], uint32(stat.PID))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that using this method you will only have the pids, and not the tids.
This means that the threads of a process tree won't be included, which is not what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm alright, good call out. I'll change this to status.TGID, in bpf context->pid is tgid as we have it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually tgid is same as pid in userspace context, why would threads not be included if in bpf we're looking at tgid?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the bpf code you update the map with parent_pid and child_pid (in sched_process_fork) - these are tids (in userspace), not tgids

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process tree map in bpf code is using tgid and in userspace we use pid, aren't these the same?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see that the bpf code is now using tgid. Should be ok now, however I wonder if it wouldn't have been better to set tids both in userspace and and bpf code for better granularity of the filter. Probably not that important, just a thought.

@@ -902,6 +940,33 @@ func (t *Tracee) populateBPFMaps() error {
}
}

err = t.populateProcessTreeMap(t.config.Filter.ProcessTreeFilter.PID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is populated before the bpf programs are attached.
If processes are added/remved after this point, they will not be added/removed from the map.
We should probably do another iteration after the bpf programs are attached, and check if no changes occured to avoid race conditions

for i := range procs {
stat, err := procs[i].Stat()
if err != nil {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to fail in that case?

if err != nil {
return nil, err
}
processMap[uint32(stat.PID)] = processMap[uint32(stat.PID)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so that all PIDs are recorded in the map, even if they don't have any child processes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is a self-assignment of processMap[uint32(stat.PID)] to itself, how does it achieves this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol that's actually interesting, I never stopped to think that it shouldn't be valid code. It does work though.

Played with it here:
https://play.golang.org/p/2uHNoyTcnfh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it doesn't work:
./prog.go:30:2: self-assignment of x[foo] to x[foo]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a go vet error, but i'm still not entirely sure why it doesn't complain in this context.

if err != nil {
return nil, err
}
processMap[uint32(stat.PPID)] = append(processMap[uint32(stat.PPID)], uint32(stat.PID))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem...

// populateProcessTreeFilterMap takes a map of the process tree (k=ppid, v=[]childpid)
// and a filter specification map (k=pid,v=trace/not) and returns a fully populated map
// where k=pid, v=trace/not for all pids
func populateProcessTreeFilterMap(processTree map[uint32][]uint32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we shuold first gather the process tree representation, and only then initialize the map.
There is a big window for race conditions using this method, and it wastes cycles for processes we don't really need (were not requested by the user tree filter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we gather the process tree first is so we can determine all of the descendant processes. Is there a way to query proc on a per pid basis to traverse through all its child pids that i'm not aware of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in for each PID we can gather its PPID, but I don't see anything like a list of child pids

Copy link
Collaborator
@yanivagman yanivagman Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is actually a way to do that. For every process, you can list all of its threads and their children using:
/proc/pid/task/tid/children

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, thanks for teaching me that. Will have iteration #5 soon :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so here is another suggestion:
For every process 'p' in procfs:

  1. Get its ppid
  2. If ppid = one of the given tree filters pids -> add 'p' to the map, and continue to the next process
  3. If ppid = 1 -> continue to the next process (as we reached the root of the process tree)
  4. Go to step 1 with ppid

This method will find the relevant set of pids which are part of the tree of the pid given by the tree filter, and update the map in (almost) the same time, reducing the window of a possible race condition

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaeldtinoco to get the changes of the process tree in runtime is not the problem here, as we have sched_process_fork and sched_process_exit.
The problem here is - how can we initialize a map that contains a "snapshot" of the pids in the process tree whose root is some given pid. The challenge here is to initialize this map and avoid race conditions of exited/new processes in this process tree

Copy link
Contributor
@rafaeldtinoco rafaeldtinoco Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaeldtinoco to get the changes of the process tree in runtime is not the problem here, as we have sched_process_fork and sched_process_exit.

I had that in mind when I wrote.

The problem here is - how can we initialize a map that contains a "snapshot" of the pids in the process tree whose root is some given pid. The challenge here is to initialize this map and avoid race conditions of exited/new processes in this process tree

That's why I referred a userland way of following the process tree (by following forks, execs and exits from userland, using netlink, in parallel to BPF program startup).

A taskstats/conector go routine could monitor selected pid trees for forks an and exits and keep updating the map by removing the pid if that exited before the BPF program was loaded (or update the map with other needs)

It would still be racy - as there is no mutex locking mechanism in between kernel BPF program and a userland task - but the fact that pid numbers would keep growing, instead of reusing just exited ones, would protect us until BPF program is loaded, as the race window is small (not bigger than pid # being reused).

Of course this ads complexity to the initial logic.

Another possibility would be to timestamp each entry in the map and have some kind of prune mechanism within userland for leftovers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanivagman A couple of unfortunate issues with your design that I've run into. If we use filepath.WalkDir() we're not able to jump to different points in the process tree, it's iterative. That can be worked around, but the reason it's iterative is because it reads the entire directory upfront, taking a snapshot of it, and presenting the same data race as the current design. Even if not using filepath.WalkDir I run into the same issue with similar implementations that aren't iterative, we can only make the race window smaller if we can't get atomic state of the process tree (i.e. halt the OS 😄 ).

@rafaeldtinoco I will read through that kernel doc you linked, i'm not familiar with that interface.

Another simple idea (again, still has unavoidable race) is to use the current design except re-check the state of the whole process tree on every iteration of traversing PID/PPID relationship (i.e. call gatherEntireProcessTree()) in every iteration of the loop in populateProcessTreeFilterMap(). I may need to explore this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grantseltzer its likely we wont use this, as a curiosity though: https://bewareofgeek.livejournal.com/2945.html ... I used something like this in my previous life example =). Its a good example on how to get PROC_EVENT_FORK, PROC_EVENT_EXIT, PROC_EVENT_EXEC from userland (before we could do this with eBPF, for example). It has a netlink socket buffer and its a stream from kernel to userland, but.. yet, its fast.

But again, I think your discussion with Yaniv is in advanced state already.

Comment on lines 50 to 52
for pid, _ := range processTree {
filterMap[pid] = defaultFilter
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to save this value for every pid in the system.
Instead, we can pass the default filter value in the config map, and use it when we don't find a given pid in the map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure this takes up some extra space, but this adds a few more instructions in each should_trace() call to check the default value if the PID isn't in the map. I don't feel that strongly about this, I can if you'd like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality_filter_matches (which you use in the bpf code) already takes care of that. You should just set CONFIG_PROC_TREE_FILTER to either FILTER_IN or FILTER_OUT in the config map.

// gatherAllDescedentPIDs takes a specific pid, and a map 'pids' which represents
// a snapshot of the process tree where k = ppid and v = slice of child pids
// and returns a slice of all the descedent pids
func gatherAllDescedentPIDs(pid uint32, pids map[uint32][]uint32) []uint32 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I think you should implement the procfs walk - only for the pids selected in the filter

Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
… after bpf programs are loaded and attached

Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
@grantseltzer
Copy link
Contributor Author

Ok, I'm satisfied with how I have this now. I've moved the bpf map update calls for the process tree map to be closer to when they're collected. This shortens the window for a data race to occur. Importantly, the bpf programs are installed before this map is populated, so the map is being updated before it's even finished being initialized. Another thing is that a data race should never be an issue when the filter is specified for long standing processes (which I expect it always will be). Ultimately there'll always be a tiny amount of risk of losing a data race unless we can halt the kernel, but the impact is negligible.

Another interesting thing I discovered is that libbpfgo's new API that uses unsafe.Pointer is a little unsafe. When we take pointers to booleans, false is typically interpreted as true in bpf code since anything other than 64 zero bits will be true. Just something to think about in the future.

Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
@grantseltzer
Copy link
Contributor Author

My latest commit also makes use of FILTER_IN/FILTER_OUT, so there's less time spent updating bpf maps, that should cut down the race window as well.

@grantseltzer grantseltzer merged commit 9d588c9 into aquasecurity:main Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Capability to filter out events by process ancestry
3 participants
0