Implement process tree filter #927

grantseltzer · 2021-08-16T15:49:19Z

Handles #836

Couple of design decisions I made worth discussing

Having support for multiple process trees would have too much of an impact on # of instructions, validator implications, and performance. I instead opted to only support one process tree filter. Hopefully this meets the use-case of CNDR.
I build the process tree after the bpf programs are attached. This removes the issue where procs start/end in between the process tree being built and when sched_proc_fork/sched_proc_exit take over maintenance
The flag is called tree and can be passed like --trace tree!=3 or --trace tree=1001

We must verify that this works on older kernels than my development one however. I'm worried about the possibility that this blows the instruction limit.

yanivagman · 2021-08-17T07:29:27Z

Doesn't work on 4.19:

[yaniv@yaniv-aqua tracee-ebpf]$ sudo ./dist/tracee-ebpf 
[sudo] password for yaniv: 
libbpf: load bpf program failed: Argument list too long
libbpf: Program too large (4795 insns), at most 4096 insns
libbpf: failed to load program 'tracepoint__raw_syscalls__sys_exit'
libbpf: failed to load object '/home/yaniv/src/tracee/tracee-ebpf/dist/tracee.bpf.4_19_201-1-MANJARO.v0_6_0-13-g7bb9807.o'
2021/08/17 10:29:37 error creating Tracee: failed to load BPF object

yanivagman

Keeping track of parent and childs relations complicates the implementation.
It requires to loop until parent is found, which wastes cpu cycles, and defines a max depth of the tree (currently set to 100, but probably we will need to lower that so it can work in older kernels <5.2).

Instead, you can use the implementation I described in #836:

Define a new map in the kernel - process_tree_map with a key that will represent a pid, and a value that will reresent if the process (and its descendants) should be traced or not.
Populate this map during the init phase of tracee, as described above
Update sched_process_fork event as follows: If the parent pid is in process_tree_map, add the child pid as well with the same value as of its parent.
Update should_trace to consult this filter (probably we can use the equality filter semantics here).
Update sched_process_exit to delete the pid entry from the map.

Using this method will not require to perform any loop, and no max depth of the process tree will limit the implementation

yanivagman

Comments inline.
Also, is there any reason to update libbpf files here?

tracee-ebpf/tracee/tracee.bpf.c

tracee-ebpf/main.go

yanivagman · 2021-08-25T05:46:14Z

tracee-ebpf/main.go

+	if err != nil {
+		return fmt.Errorf("invalid PID given to filter: %s", valuesString)
+	}
+	procTreeFilter.PID = uint32(pid)


Any reason why not to support multiple tree filters?

tracee-ebpf/tracee/tracee.bpf.c

tracee-ebpf/tracee/tracee.go

yanivagman · 2021-08-25T06:16:11Z

tracee-ebpf/tracee/proc.go

+	for i := range procs {
+		stat, err := procs[i].Stat()
+		if err != nil {
+			return nil, err


will this retrun an error if a process exited while iterating procs?

I just tested this, and yes you suspect correctly that it will.

Do we want to fail in that case?

tracee-ebpf/tracee/proc.go

yanivagman · 2021-08-25T06:34:29Z

tracee-ebpf/tracee/proc.go

+		if err != nil {
+			return nil, err
+		}
+		processMap[uint32(stat.PPID)] = append(processMap[uint32(stat.PPID)], uint32(stat.PID))


I think that using this method you will only have the pids, and not the tids.
This means that the threads of a process tree won't be included, which is not what we want.

Hm alright, good call out. I'll change this to status.TGID, in bpf context->pid is tgid as we have it now.

Actually tgid is same as pid in userspace context, why would threads not be included if in bpf we're looking at tgid?

In the bpf code you update the map with parent_pid and child_pid (in sched_process_fork) - these are tids (in userspace), not tgids

This is still a problem...

The process tree map in bpf code is using tgid and in userspace we use pid, aren't these the same?

Yes, I see that the bpf code is now using tgid. Should be ok now, however I wonder if it wouldn't have been better to set tids both in userspace and and bpf code for better granularity of the filter. Probably not that important, just a thought.

yanivagman · 2021-08-25T06:37:28Z

tracee-ebpf/tracee/tracee.go

@@ -902,6 +940,33 @@ func (t *Tracee) populateBPFMaps() error {
 		}
 	}

+	err = t.populateProcessTreeMap(t.config.Filter.ProcessTreeFilter.PID)


This map is populated before the bpf programs are attached.
If processes are added/remved after this point, they will not be added/removed from the map.
We should probably do another iteration after the bpf programs are attached, and check if no changes occured to avoid race conditions

tracee-ebpf/tracee/proc.go

yanivagman · 2021-08-30T14:59:03Z

tracee-ebpf/tracee/proc.go

+	for i := range procs {
+		stat, err := procs[i].Stat()
+		if err != nil {
+			return nil, err


Do we want to fail in that case?

yanivagman · 2021-08-30T15:00:08Z

tracee-ebpf/tracee/proc.go

+		if err != nil {
+			return nil, err
+		}
+		processMap[uint32(stat.PID)] = processMap[uint32(stat.PID)]


What does this line do?

This is so that all PIDs are recorded in the map, even if they don't have any child processes.

But this is a self-assignment of processMap[uint32(stat.PID)] to itself, how does it achieves this?

Lol that's actually interesting, I never stopped to think that it shouldn't be valid code. It does work though.

Played with it here:
https://play.golang.org/p/2uHNoyTcnfh

It seems that it doesn't work:
./prog.go:30:2: self-assignment of x[foo] to x[foo]

That's a go vet error, but i'm still not entirely sure why it doesn't complain in this context.

yanivagman · 2021-08-30T15:04:03Z

tracee-ebpf/tracee/proc.go

+		if err != nil {
+			return nil, err
+		}
+		processMap[uint32(stat.PPID)] = append(processMap[uint32(stat.PPID)], uint32(stat.PID))


This is still a problem...

yanivagman · 2021-08-30T15:19:42Z

tracee-ebpf/tracee/proc.go

+// populateProcessTreeFilterMap takes a map of the process tree (k=ppid, v=[]childpid)
+// and a filter specification map (k=pid,v=trace/not) and returns a fully populated map
+// where k=pid, v=trace/not for all pids
+func populateProcessTreeFilterMap(processTree map[uint32][]uint32,


I don't understand why we shuold first gather the process tree representation, and only then initialize the map.
There is a big window for race conditions using this method, and it wastes cycles for processes we don't really need (were not requested by the user tree filter)

The reason we gather the process tree first is so we can determine all of the descendant processes. Is there a way to query proc on a per pid basis to traverse through all its child pids that i'm not aware of?

As in for each PID we can gather its PPID, but I don't see anything like a list of child pids

Yes, there is actually a way to do that. For every process, you can list all of its threads and their children using:
/proc/pid/task/tid/children

Ahhh, thanks for teaching me that. Will have iteration #5 soon :-)

Ok, so here is another suggestion:
For every process 'p' in procfs:

Get its ppid

If ppid = one of the given tree filters pids -> add 'p' to the map, and continue to the next process

If ppid = 1 -> continue to the next process (as we reached the root of the process tree)

Go to step 1 with ppid

This method will find the relevant set of pids which are part of the tree of the pid given by the tree filter, and update the map in (almost) the same time, reducing the window of a possible race condition

@rafaeldtinoco to get the changes of the process tree in runtime is not the problem here, as we have sched_process_fork and sched_process_exit.
The problem here is - how can we initialize a map that contains a "snapshot" of the pids in the process tree whose root is some given pid. The challenge here is to initialize this map and avoid race conditions of exited/new processes in this process tree

@rafaeldtinoco to get the changes of the process tree in runtime is not the problem here, as we have sched_process_fork and sched_process_exit.

I had that in mind when I wrote.

The problem here is - how can we initialize a map that contains a "snapshot" of the pids in the process tree whose root is some given pid. The challenge here is to initialize this map and avoid race conditions of exited/new processes in this process tree

That's why I referred a userland way of following the process tree (by following forks, execs and exits from userland, using netlink, in parallel to BPF program startup).

A taskstats/conector go routine could monitor selected pid trees for forks an and exits and keep updating the map by removing the pid if that exited before the BPF program was loaded (or update the map with other needs)

It would still be racy - as there is no mutex locking mechanism in between kernel BPF program and a userland task - but the fact that pid numbers would keep growing, instead of reusing just exited ones, would protect us until BPF program is loaded, as the race window is small (not bigger than pid # being reused).

Of course this ads complexity to the initial logic.

Another possibility would be to timestamp each entry in the map and have some kind of prune mechanism within userland for leftovers.

@yanivagman A couple of unfortunate issues with your design that I've run into. If we use filepath.WalkDir() we're not able to jump to different points in the process tree, it's iterative. That can be worked around, but the reason it's iterative is because it reads the entire directory upfront, taking a snapshot of it, and presenting the same data race as the current design. Even if not using filepath.WalkDir I run into the same issue with similar implementations that aren't iterative, we can only make the race window smaller if we can't get atomic state of the process tree (i.e. halt the OS 😄 ).

@rafaeldtinoco I will read through that kernel doc you linked, i'm not familiar with that interface.

Another simple idea (again, still has unavoidable race) is to use the current design except re-check the state of the whole process tree on every iteration of traversing PID/PPID relationship (i.e. call gatherEntireProcessTree()) in every iteration of the loop in populateProcessTreeFilterMap(). I may need to explore this.

@grantseltzer its likely we wont use this, as a curiosity though: https://bewareofgeek.livejournal.com/2945.html ... I used something like this in my previous life example =). Its a good example on how to get PROC_EVENT_FORK, PROC_EVENT_EXIT, PROC_EVENT_EXEC from userland (before we could do this with eBPF, for example). It has a netlink socket buffer and its a stream from kernel to userland, but.. yet, its fast.

But again, I think your discussion with Yaniv is in advanced state already.

yanivagman · 2021-08-30T15:24:07Z

tracee-ebpf/tracee/proc.go

+	for pid, _ := range processTree {
+		filterMap[pid] = defaultFilter
+	}


I don't think we need to save this value for every pid in the system.
Instead, we can pass the default filter value in the config map, and use it when we don't find a given pid in the map

Sure this takes up some extra space, but this adds a few more instructions in each should_trace() call to check the default value if the PID isn't in the map. I don't feel that strongly about this, I can if you'd like.

equality_filter_matches (which you use in the bpf code) already takes care of that. You should just set CONFIG_PROC_TREE_FILTER to either FILTER_IN or FILTER_OUT in the config map.

yanivagman · 2021-08-30T15:27:18Z

tracee-ebpf/tracee/proc.go

+// gatherAllDescedentPIDs takes a specific pid, and a map 'pids' which represents
+// a snapshot of the process tree where k = ppid and v = slice of child pids
+// and returns a slice of all the descedent pids
+func gatherAllDescedentPIDs(pid uint32, pids map[uint32][]uint32) []uint32 {


This is where I think you should implement the procfs walk - only for the pids selected in the filter

tracee-ebpf/tracee/tracee.bpf.c

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

… after bpf programs are loaded and attached Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

grantseltzer · 2021-09-01T03:53:27Z

Ok, I'm satisfied with how I have this now. I've moved the bpf map update calls for the process tree map to be closer to when they're collected. This shortens the window for a data race to occur. Importantly, the bpf programs are installed before this map is populated, so the map is being updated before it's even finished being initialized. Another thing is that a data race should never be an issue when the filter is specified for long standing processes (which I expect it always will be). Ultimately there'll always be a tiny amount of risk of losing a data race unless we can halt the kernel, but the impact is negligible.

Another interesting thing I discovered is that libbpfgo's new API that uses unsafe.Pointer is a little unsafe. When we take pointers to booleans, false is typically interpreted as true in bpf code since anything other than 64 zero bits will be true. Just something to think about in the future.

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

grantseltzer · 2021-09-01T04:29:53Z

My latest commit also makes use of FILTER_IN/FILTER_OUT, so there's less time spent updating bpf maps, that should cut down the race window as well.

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

grantseltzer requested review from rafaeldtinoco, yanivagman and mtcherni95 August 16, 2021 15:49

grantseltzer linked an issue Aug 16, 2021 that may be closed by this pull request

Capability to filter out events by process ancestry #836

Closed

grantseltzer added the area/ebpf label Aug 16, 2021

yanivagman reviewed Aug 17, 2021

View reviewed changes

grantseltzer requested a review from yanivagman August 23, 2021 15:52

yanivagman requested changes Aug 25, 2021

View reviewed changes

yanivagman reviewed Aug 29, 2021

View reviewed changes

tracee-ebpf/tracee/proc.go Outdated Show resolved Hide resolved

yanivagman reviewed Aug 30, 2021

View reviewed changes

grantseltzer added 18 commits August 31, 2021 20:08

Implement process tree filtering

00ea921

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

More efficient solution

0a971e1

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Don't change libbpf submodule version

ad83278

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

remove debug print lines and update to latest usage of libbpfgo

32a11cd

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

remove code from bad rebase

7f57a2e

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Populate entire process tree instead of just filters

7e9bbb0

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Remove old design code

c299b8f

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

fix logic for checking equality of process tree filter

e9ed02e

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Check for no arguments being passed

ea1fc36

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Use tgid instead of tid in kernel space

9772337

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

support multiple process tree filters

bec7efb

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Fix logic for default filter behavior

e490e8c

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Remove two pass process, not needed anymore

33fa109

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Fix tgid/pid in sched_proccess_exit, populate process tree filter map…

76fec51

… after bpf programs are loaded and attached Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Move BPF map update to be closer to when the filter is calculated

6eddc57

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

remove extra line

a9769f3

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

ensure safety of setting boolean in bpfmap with unsafe.Pointer

bf10705

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

fix spacing maybe

e8b3a00

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

grantseltzer added 4 commits August 31, 2021 23:55

update help text

980e11c

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Readd bpf code after bad rebase

0b7e3e1

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

ignore when proc exits while iterating over procfs

7f13a47

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

Make use of FILTER_IN/FILTER_OUT, no need to populate unrelated pids

94efe42

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

yanivagman and others added 5 commits September 1, 2021 16:29

Update from leaf to root

f62147f

fmt and cosmetics

9c05c4e

trace root of tree

1f11837

find ppid inline

fd52e24

parse ppid out of stat file directly

83bd0c7

Signed-off-by: grantseltzer <grantseltzer@gmail.com>

yanivagman approved these changes Sep 1, 2021

View reviewed changes

grantseltzer merged commit 9d588c9 into aquasecurity:main Sep 1, 2021

Implement process tree filter #927

Implement process tree filter #927

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment