Christiano Haesbaert

Signaling from within: how eBPF interacts with signals

This article explores some of the semantics of UNIX signals when generated from an eBPF program.

Signaling from within: how eBPF interacts with signals

Background

Signals have been around since the UNIX First Edition in 1971 and while its semantics and system calls suffered changes throughout the years, its uses and application have remained largely the same. Usually, when we talk about signal semantics, we're talking about what userland can observe and interact with. After all, we mostly generate and handle signals to/from userland processes.

In this publication, we will explore some of the semantics from signals generated inside the kernel within an eBPF program. More so, we’ll identify what kind of effects and guarantees we observed after the handling of such signals. You can find more information about eBPF in this article.

Motivation

In Elastic Defend for Containers we utilize eBPF in Linux Security Module (LSM) hooks that restrict access to system resources. Using LSM is the preferred way to conduct this kind of restriction as the eBPF program can return an error like EPERM (an operation was attempted, but without proper privileges), which is propagated up to the system call return value.

The problem with using eBPF+LSM in this way is that support is relatively new and only applies to AMD64 for the most part. Therefore, we wanted to explore using the eBPF helper bpf_send_signal() where necessary, like older kernels or different architectures. Instead of failing the system call with EPERM, bpf_send_signal() would be used to send a SIGKILL to the current process and terminate it, arguably more dramatic but still reasonable given the limitations.

Generally, we aim to answer these questions:

  • What side effects are observed (if any) after the program receives a SIGKILL
  • Which of the side effects (if any) result from the signal subsystem design versus the implementation
  • If the kernel code shifts in the future, how will that impact these side effects

Scenario: blocking openat(2)

Imagine we would like to prevent certain processes from opening files and, for the sake of simplicity, we would like to prevent these processes from using an openat(2) system call.

If LSM were available, we would hook our eBPF program in the LSM hook security_file_open(), return EPERM, and then openat(2) would fail gracefully. Because LSM is not available, we’ll instead generate a SIGKILL, but first, we need to figure out a place to hook our eBPF program in the kernel.

We have options: use a static tracepoint like syscalls:sys_enter_openat2 or we can use kprobes and run our eBPF program from a kernel function of our choice. Obvious candidates would be vfs_open, do_sys_openat2 (happens a little earlier), or __x64_sys_openat (even earlier, but machine-dependent). We can test it with bpftrace:

bpftrace --unsafe -e 
 'kprobe:vfs_open /str(((struct path *)arg0)->dentry->d_name.name) == "__noopen"/ 
 { signal("SIGKILL") }'

# In another tty we can put it to the test
$ strace /bin/cat /tmp/__noopen
...
openat(AT_FDCWD, "/tmp/__noopen", O_RDONLY) = ?
+++ killed by SIGKILL +++
Killed

We can see that cat(1) is terminated with SIGKILL the moment it attempts to open the file. At first glance, this appears to work correctly, but it may be premature to declare victory.

It's important to note that the signal is not being generated by an external process but from the context of the cat(1) process doing the system call to itself. It is performing the equivalent of kill(0, SIGKILL) from within the kernel, where 0 means “self”.

The only thing we’ve proven is that the program is indeed terminated, but this opens more questions:

  • Did we block openat(2) or not?
  • Does the outcome change If we successfully block openat(2)?
  • Are there more observable side effects?

If we conduct the same experiment, on the same path but with a nonexistent file, and pass the O_CREAT flag to openat(2), is the file created? Is the application still terminated? Let’s see what happens:

$ rm /tmp/__noopen 
$ strace /bin/touch /tmp/__noopen
...
openat(AT_FDCWD, "/tmp/__noopen", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = ?
+++ killed by SIGKILL +++
Killed

In this case, we are still terminated by SIGKILL. But, if we examine the filesystem, there is now an empty file created by the offending program! We can conclude that openat(2) did somehow execute because the file creation was observed.

Kernel handling of a SIGKILL

Signals can’t be handled online; instead, they must be post-processed at safe points. By online we mean: If I'm doing a system call, and a SIGKILL arrives, I cannot just cease to exist. Signals must be checked at safe points, and in most UNIXes this is done before returning to userland.

The check for signal pending is done after running the system call at exit_to_user_mode_loop(). If TIG_SIGPENDING is set in the current task structure, the process branches into the signal handling code. When SIGKILL (a fatal signal) is pending, the process branches into do_group_exit() which never returns, resulting in the end of the process.

Why post-process signals

Signals must be post-processed and handled at safe points, otherwise the kernel would have to account for the process involuntarily exiting due to a fatal signal. We can conduct a thought experiment and imagine an implementation that attempts to process signals the moment they arrive. This could be implemented by interrupting the running process and forcing it to exit from the interrupt context, for example:

  • Process A running on cpu0 performs a system call
  • Process B running on cpu1 sends a SIGKILL to process A
  • An IPI would be sent from cpu1 to cpu0
  • cpu0 would trap into an interrupt frame, realize it is here due to a signal being sent, and perform an exit of the current process

Thankfully this is not the case, you cannot exit from an interrupt context – furthermore, this couldn’t be implemented without introducing significant changes to resource management. When a process exits, it must release any resources – like locks, reference counts, or any other kind of mutable data that may be influenced by the exiting process.

We can trace a parallel with kernel preemption, as Linux is highly preemptive when configured with CONFIG_PREEMPT_FULL. This allows the scheduler to shelve the running process while it is in kernel space and run other processes. From the point of view of the process being preempted, this is an involuntary context switch as it did not voluntarily release the CPU. This is orthogonal from a preemptive userland where the scheduler preempts a running process running in user mode. Historically, UNIX systems did not employ a preemptive kernel, the strategy to maintain low latency relied solely on fast(short) system calls and interrupt priorities.

Programming with preemption is harder because the kernel programmer must always consider the impact of being preempted and judge when to disable preemption. Failure to disable preemption at the right time, for example on spinlocks, could result in another process spinning on a lock of a preempted process indefinitely.

If we allowed a process to exit involuntarily from a trap frame, it would be a bit like preemption, but much harder – if not impossible. The kernel programmer would now have to always consider "what happens if my process involuntarily exits here?", and this would likely involve having to register callbacks to release resources on exit.

Hopefully, it's now clear why signals can’t be handled online. Signals in Linux, like other systems, are processed just before returning to userland.

Posting a SIGKILL from eBPF

Let us follow the lifecycle of a SIGKILL originating from an eBPF program until the process is terminated.

When an eBPF program calls the special helper bpf_send_signal(SIGKILL) we end up in bpf_send_signal_common(SIGKILL, PIDTYPE_TGID). PIDTYPE_TGID is the "task group id" and it specifies that any task (meaning any pthread) of the current process may accept the signal. But eBPF also provides bpf_send_signal_task() which sends the signal only to the current task by specifying PIDTYPE_PID instead.

bpf_send_signal_common() has to be used with caution because it must be able to generate a signal from any point in the kernel where you can attach an eBPF program; which is tricky work that has resulted in some past bugs like this deadlock. This is an interesting imposition created by eBPF; before it, signals generated from the kernel were done so in controlled points.

Most of the heavy lifting of posting a signal is done in __send_signal_locked() and complete_signal() and we get there through the following stack:

complete_signal()          ^
__send_signal_locked()     |
send_signal_locked()       |
do_send_sig_info()         |
group_send_sig_info()      |
bpf_send_signal()          |


static int __send_signal_locked(int sig, 
    struct kernel_siginfo *info, struct task_struct *t, 
    enum pid_type type, bool force)

In our case, in __send_signal_locked: sig is SIGKILL, info is SEND_SIG_PRIV, t is the current task (the running thread), type is PIDTYPE_TGID and force is true, which is always set when info is SEND_SIG_PRIV, this means this is a signal originating from the kernel, not from some userland program.

__send_signal_locked() will register a SIGKILL as pending inside a structure of the current task (our t), which is a process-wide structure shared by all tasks (pthreads) in this process (since we're using PIDTYPE_TGID), and control is then passed to complete_signal().

SIGKILL is a bit special in complete_signal() as it is a fatal signal, the pending signal bit that was set in the shared structure of the process will then be replicated to a per-task pending set. This means a SIGKILL is marked as pending for every pthread of the current process.

complete_signal() then wakes up all threads via signal_wake_up+signal_wake_up_state() so that they can be terminated. Each thread must terminate on its own and send a signal politely asking the thread to “please exit next time instead of returning to userland”.

In the signal_wake_up() stack, a flag TIG_SIGPENDING will be set, warning the task to check its pending signals. It might be that the thread is in userland at the time we try to wake it up, even worse it might be infinitely looping. In that case, it would not enter the kernel until the scheduler decides to preempt it or an interrupt fires. This case is avoided by forcing the thread to enter the kernel via kick_process(), which sends an IPI to the remote CPU, forcing it to trap the process into the kernel, which will then try to return to userland, check TIG_SIGPENDING, find a SIGKILL, and terminate.

Voluntary signal checking

While signals are only processed when returning to userland, checking if those signals are pending can be done anywhere. tmpfs, ext4, xfs, and many other filesystems will check if a fatal signal is pending before starting a write. If a fatal signal is pending, they will return an error to the caller, unwinding the system call stack up until the point of returning to userland, which then terminates the program as we've seen before. The voluntary check for tmpfs and ext4 write can be seen here.

We can now reason what happens in tmpfs if we install an eBPF program that generates a SIGKILL early in kernel entry: the write would not be issued, as the signal would be noticed, and the operation aborted.

Btrfs doesn't behave like other filesystems, however. It doesn't check for signals before issuing a write or read further down the IO stack. When a SIGKILL is received, it completes the IO operation before terminating.

We cannot prevent Btrfs from being able to write by generating a SIGKILL from an eBPF program when the program enters the write system call. Assuming this is what we would like to do, it’s logical to consider generating a SIGKILL earlier on openat(2): this way we terminate the program much earlier, even before it has a chance to issue a write. Unfortunately, this is also unreliable, as demonstrated in the next section.

Racing open & write operations

If we generate the SIGKILL in openat(2), it is still possible to write to a file descriptor that would be returned, at least with Btrfs. The following bpftrace line will install a tiny eBPF program on vfs_open() that will generate a SIGKILL and terminate any process trying to open the file named __nowrite.

bpftrace --unsafe -e 'kprobe:vfs_open /str(((struct path *)arg0)->dentry->d_name.name) == "__nowrite"/ 
 { signal("SIGKILL") }'

It's still possible to race the kernel and write to the would-be file descriptor, meaning we can't rely on this mechanism to prevent the file from being modified even if we can terminate the process.

It should be clear by now that the open operation happens, as discussed at the beginning of this article. A file can be created with the O_CREAT flag, and then the effects that occur between the open operation and process termination are observable. The important observable effect is that the process file table is populated just before it terminates.

The process file table is a per-process in-kernel table that maps file descriptor numbers to file objects. This is where, for example, file descriptor 1 refers to a file object representing standard output, so if userland calls write(1, "foo", strlen("foo")), the kernel will look for the object referenced by file descriptor 1 and call vfs_write() on it. The file structure has callbacks that know how to write to standard output, we say this is the backing of the file descriptor.

The general idea is to guess the file descriptor number that would be returned by an open operation and attempt to write to it before the process is terminated but after the open operation takes effect.

The first trick is figuring out what the file descriptor number would be, this can be done with:

int guessed_fd;

guessed_fd = dup(0);
close(guessed_fd);

When a file descriptor is created via dup(2), open(2), accept(2), socket(2), or any other system call, it is guaranteed to use the lowest available number. If we dup any file descriptor and close it, the next system-call-creating file descriptor will likely end up using the same index that we got from dup(2) earlier. This isn’t necessarily true for multithreaded programs, as another thread might create a file descriptor and invalidate our guess. It’s because of these races that dup2(2) exists, to allow multithreaded programs to have a race-free dup. Multithreading was a late addition to UNIX systems, so the old semantics of file descriptor numbering had to be preserved.

This guessing is not necessary because we have a controlled environment. However, it is interesting because it could be used as the base block for an attack trying to exploit this race condition.

Now that we have a target file descriptor, we can spawn a bunch of worker threads attempting to write to it!

/*
 * Guess the next file descriptor open will get
 */
if ((fd = dup(0)) == -1)
	err(1, "dup");
close(fd);

/*
 * Hammer Time, spawn a bunch of threads to write at the guessed fd,
 * they hammer it even before we open.
 */
while (num_workers--)
	if (pthread_create(&t_writer, NULL, writer, &fd) == -1)
		err(1, "pthread_create");

/* Give the workers some lead time */
msleep(10);

/*
 * This should never return, since we are supposed to be SIGKILLed.
 * The race depends on the workers hitting the file descriptor after
 * open(2) succeeded (after fd_install()) but before
 * exit_to_user_mode()->do_group_exit().
 */
fd = open(path, O_RDWR|O_CREAT, 0660);
errx(1, "not killed, open returned fd %d", fd);

The writer-worker code is as simple as you could expect:

void *
writer(void *vpfd)
{
	ssize_t n;
	int fd = *(int *)vpfd;

	/*
	 * We'll just hammer-write the guessed file descriptor, if we succeed
	 * we just bail as the parent thread is about to do it anyway.
	 */
	while (1) {
		n = write(fd, SECRET, strlen(SECRET));
		/* We expect to get EBADFD mostly */
		if (n <= 0) {
			continue;
		}
		/* Hooray, the file has been written */
		break;
	}

	return (NULL);
}

The complete program is available here.

Most of the time we can't trigger the race condition and the program terminates with SIGKILL. With enough attempts from running the program in a loop, though, we can hit the race in about a minute.

truncate -s0 __nowrite
until test -s __nowrite; do ./race-openwrite __nowrite; done

It's worth pointing out that this behavior is not a kernel bug in any way and is only reproducible in Btrfs. We've failed to trigger this race condition in other filesystems like ext4, tmpfs, and xfs as these implementations explicitly check for a fatal signal pending before proceeding with the write.

Other Effects

We’ve talked about open and write, and we've also checked the behavior of attempting to block the effects of other system calls by generating SIGKILL. In the table below, BLOCKED means the effect did not occur. For example, unlink did not remove the file. As you can guess, UNBLOCKED means the effect did occur – unlink did remove the file. In both cases the program is always SIGKILLed, meaning our signal generation did occur.

6.5.5-200.fc38.x86_64BtrfstmpfsExt4
chmod(2)UNBLOCKEDUNBLOCKEDUNBLOCKED
link(2)UNBLOCKEDUNBLOCKEDUNBLOCKED
mknod(2)UNBLOCKEDUNBLOCKEDUNBLOCKED
write(2)UNBLOCKEDBLOCKEDBLOCKED
race-open-writeUNBLOCKEDBLOCKEDBLOCKED
rename(2)UNBLOCKEDUNBLOCKEDUNBLOCKED
truncate(2)UNBLOCKEDUNBLOCKEDUNBLOCKED
unlink(2)UNBLOCKEDUNBLOCKEDUNBLOCKED
6.1.55-75.123.amzn2023.aarch64XFS
chmod(2)UNBLOCKED
link(2)UNBLOCKED
mknod(2)UNBLOCKED
write(2)BLOCKED
race-open-writeBLOCKED
rename(2)UNBLOCKED
truncate(2)UNBLOCKED
unlink(2)UNBLOCKED
Instruction6.5.5-200.fc38.x86_646.1.55-75.123.amzn2023.aarch64
write(2) on a pipe(2)UNBLOCKEDUNBLOCKED
fork(2)BLOCKEDBLOCKED

The same behavior is observed for all the equivalent “at” system calls: openat(2), renameat(2)...

Conclusion

We’ve demonstrated some of the pitfalls of attempting to use SIGKILL as a security mechanism from eBPF, while there are cases where it can be used reliably, those are delicate and require a deep understanding of the environment in which they are run. The key takeaways from this article are:

  • Signal generation from within eBPF is synchronous since it’s generated to-and-from the same process context
  • Signals are processed in the kernel after the system call takes place
  • Specific system calls and combinations will avoid starting an operation if a fatal signal is pending
  • We can’t reliably prevent a write(2) on Btrfs, even if we kill the program before open(2) returns from the kernel

While our research is thorough, these are delicate semantics that might depend on external factors. If you believe we’ve missed something please do not hesitate to contact us.

If you’re interested in seeing more, the programs and scripts used in this research are public and available in this repository. Interested in learning more about the kernel? Check out this deep dive on call-stacks.