By Artem Dinaburg
eBPF (extended Berkeley Packet Filter) has emerged as the de facto Linux standard for security monitoring and endpoint observability. It is used by technologies such as BPFTrace, Cilium, Pixie, Sysdig, and Falco due to its low overhead and its versatility.
There is, however, a dark (but open) secret: eBPF was never intended for security monitoring. It is first and foremost a networking and debugging tool. As Brendan Gregg observed:
eBPF has many uses in improving computer security, but just taking eBPF observability tools as-is and using them for security monitoring would be like driving your car into the ocean and expecting it to float.
But eBPF is being used for security monitoring anyway, and developers may not be aware of the common pitfalls and under-reported problems that come with this use case. In this post, we cover some of these problems and provide workarounds. However, some challenges with using eBPF for security monitoring are inherent to the platform and cannot be easily addressed.
Pitfall #1: eBPF probes are not invoked
In theory, the kernel is never supposed to fail to fire eBPF probes. In practice, it does. Sometimes, although very rarely, the kernel will not fire eBPF probes when user code expects to see them. This behavior is not explicitly documented or acknowledged, but you can find hints of it in bug reports for eBPF tooling.
This bug report provides valuable insight. First, the issues involved are rare and difficult to debug. Second, the kernel may be technically correct, but the observed behavior on the user side is missing events, even if the proximate behavior was different (e.g., too many probes). Comments on the bug report present two theories for why events are missing:
- First, there is a set limit on the number of kRetProbes that the kernel can have active at once. As of kernel 6.4.5, the default limit is 4,096. Attempts to create more kRetProbes will fail, resulting in a missed event.
- Second, the callback logic for a kProbe and a kRetProbe is slightly different, which means that sometimes a kProbe will not see a matching kRetProbe, resulting in a missed event.
More of these issues are likely lurking in the kernel, either as documented edge cases or surprise emergent effects of unrelated design decisions. eBPF is not a security monitoring mechanism, so there is not a guarantee that probes will fire as expected.
Workarounds
None. The callback logic and value for the maximum number of kRetProbes are hard-coded into the kernel. While one can manually edit and rebuild the kernel source, doing so is not advisable or feasible for most scenarios. Any tools relying on eBPF must be prepared for an occasional missing callback.
Pitfall #2: Data is truncated due to space constraints
An eBPF program’s stack space is limited to 512 bytes. When writing eBPF code, developers need to be particularly cautious about how much scratch data they use and the depth of their call stacks. This limit affects both the amount and kind of data that can be processed using eBPF code. For instance, 512 bytes is less than the longest permitted file path length, which is 4,096 bytes.
Workarounds
There are multiple options to get more scratch space, but they all involve cheating. Thanks to the bpf_map_lookup_elem
helper, it’s possible to use a map’s memory directly. Directly using maps as storage effectively functions as malloc
, but for eBPF code. A plausible implementation is a per-CPU array with a single key, whose size corresponds to our allocation needs:
u64 first_key = 0; u8 *scratch_buffer = per_cpu_map.lookup(&first_key); // implemented with bpf_map_lookup_elem
However, how do we send this data back to our user mode code? A naive approach is to use even more maps, but this approach fails with variable-sized objects like paths and it also wastes memory. Maps can be very expensive in terms of memory use because data must be replicated per CPU to ensure integrity. Unfortunately, per-CPU maps allocate memory based on the number of possible hot-swappable CPUs. This number can easily be huge—on VMWare Fusion, it defaults to 128, so a single map entry wastes 127 times as much space as it uses.
Another approach is to stream data through the perf ring buffer. The linuxevents
library uses this method to handle variable paths. The following is an example pseudocode implementation of this approach:
u64 first_key = 0; u8 *scratch_space = per_cpu_array.lookup(&first_key); for (const auto &component_ptr : path.components()) { bpf_probe_read_str(scratch_space, component_ptr, scratch_space_size); perf_submit(scratch_space); }
Streaming data through the perf
ring buffer significantly increases the effective size of each component and also enhances space efficiency, albeit at the expense of additional data reconstruction work. To handle edge cases like untriggered probes or lost/overwritten data, a recovery method must be implemented after data transmission. Unfortunately, perf
buffers are allocated in a similar way to per-CPU maps. On newer systems, the BPF ring buffer can be used instead to avoid that issue (the same ring buffer is shared across CPUs)
Pitfall #3: Limited instruction count
An eBPF program can have only 4,096 instructions, and reusing code (e.g., by defining a function) is not possible. Until recently, loops were not supported (or they had to be manually unrolled). While eBPF allows a maximum of 1 million instructions to be executed at runtime, the program can still be only 4,096 instructions long.
Workarounds
Rebuild your programs to take advantage of bounded loops (i.e., loops where the iteration count can be statically determined). These loops are now supported and they save precious program space compared to unrolling loops. Another workaround to increase the program size is multiple programs that tail call each other, which they can do up to 32 times until execution is interrupted. A drawback of this approach is that program state is lost between each transition. To keep state across tail calls, consider storing data in an eBPF map accessible by all 32 programs.
Pitfall #4: Time-of-check to time-of-use issues
An eBPF program can and will run concurrently on different CPU cores. This is true even for kernel code. Since there is no way to call kernel synchronization functions or to reliably acquire locks from eBPF, data races and time-of-check to time-of-use issues are a serious concern.
Workarounds
The only workaround is to carefully choose the event attach point, depending on the program. For example, eBPF commonly needs to work with functions that accept user data. In this situation, a good attach point is right after user data has been read into kernel mode.
When dealing with kernel code and synchronization is involved, you may not be able to mitigate time-of-check to time-of-use issues. As an example, the dentry
structure that backs files is often modified under lock by the kernel, and it is impossible to acquire these locks from an eBPF probe. Often the only indication that something is wrong is a bad return code from an API like bpf_probe_read_user
. Make sure to handle such errors in a way that does not completely make the event data unusable. For example, if you are streaming data through perf
in different packets, insert an error packet that notifies clients of missing data so that they can realign themselves to the event stream without causing corruption.
Pitfall #5: Event overload
Because eBPF lacks concurrency primitives and an eBPF probe cannot block the event producer, an attach point can be easily overwhelmed with events. This can lead to the following issues:
- Missed events, as the kernel stops calling the probe
- Data loss due to the lack of storage space for new data
- Data loss due to the complete overwriting of older but not yet consumed data by newer information
- Data corruption from partial overwrites or complex data formats, disrupting normal program operation
These data loss and corruption scenarios depend on the number of probes and events that are adding items into the event stream and on the extent of system activity. For instance, a docker container startup sequence or a deployment script can trigger a surprisingly large number of events. Developers should choose events to be monitored carefully and should avoid repetition and constructs that can make it harder to recover from data loss.
Workarounds
The user-mode helper should treat all data coming from eBPF probes as untrusted. This includes data from your own eBPF probes, which is also susceptible to accidental corruption. There should also be some application-level mechanism to detect missing or corrupted data.
Pitfall #6: Page faults
Memory that has not been accessed recently may be paged out to disk—be it a swap file, a backing file, or a more esoteric location. Normally, when this memory is needed, the kernel will issue a page fault, load the relevant content, and continue execution. For various reasons, eBPF runs with page faults disabled—if memory is paged out, it cannot be accessed. This is bad news for a security monitoring tool.
Workarounds
The only workaround is to hook right after a buffer is used and hope it does not get paged out before the probe reads it. This cannot be strictly guaranteed since there are no concurrency primitives, but the way the hook is implemented can increase the likelihood of success.
Consider the following example:
int syscall_name(const char *user_mode_ptr) { function1(); function2(user_mode_ptr); function3() return 0; }
To make sure that user_mode_ptr
can be accessed, this code first hooks into the entry of syscall_name
and saves all of the pointer parameters in a map. It then searches for a place where user_mode_ptr
is almost certainly accessible (i.e., anything past the call to function2
) and sets an attach point there to read the data. The following are some options for the attach point:
- On
function2
exit - On
function3
entry - On
function3
exit - On
syscall_name
exit
You may be wondering why we don’t just hook function2
directly. While this can work occasionally, it is normally a bad idea:
function2
is often called outside of the context you are interested in (i.e., outside ofsyscall_name
).function2
may not have the same signature across kernel revisions. If we just use the function as an opaque breakpoint, signature changes do not affect our probe.
Also note that, at times, the parameter changes during a system call, and we need to read it before the data is gone. For example, the execve
system call replaces the entire process memory, erasing all initial data before the call completes.
Again, developers should assume that some memory may be unreadable by the eBPF probe and develop accordingly.
Embracing benefits, addressing limitations
eBPF is a powerful tool for Linux observability and monitoring, but it was not designed for security and comes with inherent limitations. Developers need to be aware of pitfalls like probe unreliability, data truncation, instruction limits, concurrency issues, event overload, and page faults. Workarounds exist, but they are imperfect and often add complexity.
The bottom line is that while eBPF enables exciting new capabilities, it is not a silver bullet. Software using eBPF for security monitoring must be built to gracefully handle missing data and error conditions. Robustness needs to be a top priority.
With care and creativity, eBPF can still be used to build next-generation security tools. But it requires acknowledging and working around eBPF’s constraints, not ignoring them. As with any technology, the most effective security monitoring solutions will embrace eBPF while being aware of how it can fail.