Engineering

Seccomp in the Elastic Stack

After giving a presentation about what is done in Elasticsearch to improve out-of-the-box security, safety and usability and engaging in a couple of follow-up discussions at different events, I decided to dig a little bit deeper into the topic of Linux’s seccomp.

What is seccomp?

The idea of seccomp is to prevent the execution of certain system calls by a given application.

Why is this kind of security feature needed? Imagine someone finds a way to execute code within your application, which also implies they have the same user rights through which the application was started. Common use cases for this are web applications, where malicious attackers find a way to execute arbitrary commands — often due to invalid validation of input arguments. These attacks are categorized as remote code execution (RCE) vulnerabilities. RCE attacks are often used as a gateway to install further malicious applications such as rootkits, bind shells, or bitcoin miners.

Now imagine your web application could tell the kernel upon startup: if certain system calls are about to be executed (such as a call to execute or fork another command or binary), then the kernel should abort this call or even kill the process — thus preventing any malicious code from being executed.

This is seccomp in a nutshell.

Seccomp is an abbreviation for “secure computing” mode. The basic idea is that the application knows best what system calls it is supposed to execute — usually only a small subset from the hundreds of existing ones. If a certain system call is not on your list of allowed ones, why not reject it? Seccomp allows you to build an application sandbox.

The basic idea for seccomp was already added in Linux 2.6.12 in 2005, where an application could write 1 to /proc/$PID/seccomp to enter a strict mode where only read, write, exit, and sigreturn() calls were allowed from that moment onwards. In Linux 3.5 in 2012, the foundation was laid for the ability to control the system calls that are permitted. From Linux 3.17 onwards, an “own system call” named seccomp was added for easier configuration.

If you’re asking yourself: Is this functionality actually used? The answer is: most definitely! If you are reading this blog post via Google Chrome, then your renderer is using seccomp. The same is true for flash content through Chrome. Other applications like Firefox, OpenSSH, Docker, QEMU, systemd, Android, or firecracker are also leveraging seccomp.

Registering a seccomp filter

In order to register a seccomp filter, a filter must be installed with the seccomp() system call. The filter is written as a Berkeley Packet Filter (BPF) program. If you have ever experienced the delight of debugging a networking issue, you may already have used BPF as part of an expression in tcpdump. Once a BPF filter is registered, every system call of that application triggers the execution of the filter. The advantage of using BPF is that filtering is done in kernel space. As no data needs to be passed to userspace, the process of filtering is very efficient.

BPF has some handy properties like dead code detection, but the guarantee that the program will be completed is particularly good to have. BPF is able to tout this property because there are no instructions to jump backwards, as the program execution resembles a directed acyclic graph. Those BPF programs may not be the most friendly for human readability, but we’ll explore what can be done about this later on in this blog post.

Note: in some cases the filter is also referred to as a policy

When the seccomp filter is run, it can decide the following outcomes of the system call:

  1. The system call can be allowed
  2. The process or the thread can be killed
  3. An error is returned to the caller in addition to logging

There is a helper library called libseccomp that eases seccomp configuration a bit. This can be used in your own applications instead of opting for the more low-level way, as has been done in the Elasticsearch approach of not further depending on libraries.

Elasticsearch and seccomp

As all of the above is pretty low level, how does this integrate into Elasticsearch?

Looking at the code, there is a call to Bootstrap.initializeNatives() early in the startup process — which in turn calls Natives.tryInstallSystemCallFilter(). The previous method first checks if Java Native Access (JNA) is available, and only then proceeds to JNANatives.tryInstallSystemCallFilter() in an attempt to execute SystemCallFilter.init() successfully. It will log an error if this call is not working.

A little background on what JNA is: JNA allows access to native shared libraries to invoke native code. This is exactly what is needed to set up seccomp! Let’s take a closer look at the SystemCallFilter.init() method, as it reveals an interesting piece of information:

    static int init(Path tmpFile) throws Exception {
        if (Constants.LINUX) {
            return linuxImpl();
        } else if (Constants.MAC_OS_X) {
            // try to enable both mechanisms if possible
            bsdImpl();
            macImpl(tmpFile);
            return 1;
        } else if (Constants.SUN_OS) {
            solarisImpl();
            return 1;
        } else if (Constants.FREE_BSD || OPENBSD) {
            bsdImpl();
            return 1;
        } else if (Constants.WINDOWS) {
            windowsImpl();
            return 1;
        } else {
            throw new UnsupportedOperationException("syscall filtering not supported for OS: '" + Constants.OS_NAME + "'");
        }
    }

This call does something that we have not yet mentioned. It tries to install the system call filter not only for Linux, but also for osx, BSD, solaris, and windows. This kind of functionality exists not only under Linux, but also under all the other Elasticsearch-supported operating systems (and more!). Though different in the details and their specific nomenclature as every operating system uses different syscall names (it is only called seccomp under Linux), we can still achieve similar functionality. For the sake of simplicity, we will stick with Linux for the rest of this blog post, as there are indeed implementation differences.

Now let us take a look at SystemCallFilter.linuxImpl(). There are two ways to install a seccomp filter: using seccomp() or prctrl() system calls on older Linux kernels. The code tries to install using both ways to ensure it is possible. The interesting bit, however, is the filter that actually does get installed. Let’s take a look at that:

SockFilter insns[] = {
          /* 1  */ BPF_STMT(BPF_LD  + BPF_W   + BPF_ABS, SECCOMP_DATA_ARCH_OFFSET),

          // if (arch != audit) goto fail;
          /* 2  */ BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K,   arch.audit,     0, 7),
          /* 3  */ BPF_STMT(BPF_LD  + BPF_W   + BPF_ABS, SECCOMP_DATA_NR_OFFSET),   

          // if (syscall > LIMIT) goto fail;
          /* 4  */ BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K,   arch.limit,     5, 0),

         // if (syscall == FORK) goto fail;
          /* 5  */ BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K,   arch.fork,      4, 0),

          // if (syscall == VFORK) goto fail;
          /* 6  */ BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K,   arch.vfork,     3, 0),

          // if (syscall == EXECVE) goto fail;
          /* 7  */ BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K,   arch.execve,    2, 0),

          // if (syscall == EXECVEAT) goto fail;
          /* 8  */ BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K,   arch.execveat,  1, 0),

          // pass: return OK;
          /* 9  */ BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),

          // fail: return EACCES;
          /* 10 */ BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ERRNO | (EACCES & SECCOMP_RET_DATA)),
        };

The formatting in the source is slightly different; it has been changed for readability. The code comments are super nice and make this easier to understand. The first check is checking for the proper architecture and the syscall limit in order to ensure consistency, as otherwise the ids of the syscalls might be different and thus the wrong syscalls get blocked. The meat of the checks are really about the FORK, VFORK, EXECVE, and EXECVEAT system calls, which will prevent starting or forking new processes as in the web application example mentioned above.

Why those four system calls? These calls are always called whenever the operating system starts a process. So by preventing these system calls, no external processes can be started from within your application.

After the filters are installed using either the prctrl or the seccomp syscall, there is one final check if they have been installed properly and then business goes on as usual. The whole SystemCallFilter class with support for all the operating systems has about 600 lines of code — a bit of code with a huge security impact.

As you can see, this implementation uses the seccomp statements in pretty much the same way as if they were written in C. Thanks to JNA, this makes it as readable as a C variant, only it’s within Java.

Beats have taken a slightly different approach by allowing you to configure the list of allowed or rejected system calls within their respective YAML configuration. In the next section, we’ll dive into how Beats and seccomp work together.

Beats and seccomp

Beats is another user of seccomp. All Elastic Beats are using it and enable seccomp by default. To simplify configuration, a go-seccomp-bpf library was written. The main goal of this library is to write seccomp policies in YAML as follows:

seccomp:
  default_action: allow

  syscalls:
  # Network sandbox example.
  - action: errno
    names:
    - connect
    - accept
    - sendto
    - recvfrom
    - sendmsg
    - recvmsg
    - bind
    - listen

This allows all system calls with the exception of networking ones. This Beat would not be able to do anything network specific, which would be hard given the fact that you need to send the Beat data somewhere.

By default, the behavior of Beats is similar to Elasticsearch in that it prevents processes from being executed or forked along with some more system calls being blocked. The implementation on the Beats side goes a step further and uses a whitelist rather than a blacklist. This comes at the expense of having a higher maintenance requirement.

Using Auditbeat to monitor seccomp violations

Auditbeat is actually able to log all seccomp violations. The event.action value will be violated-seccomp-policy and the event will contain information about the process and system call. You can also filter for any seccomp event by checking for event.type:seccomp. This will allow you to monitor seccomp changes over time and adapt to your own policies in case applications change over time.

Let’s take a look at how to index those events into a cloud cluster.

Generate seccomp events with firejail

In order to intentionally generate seccomp events, spin up a linux machine, download Auditbeat, and install a small tool named firejail. First, let’s try to bind to a port using netcat:

$ nc -v -l 8000
Listening on [0.0.0.0] (family 0, port 8000)

Any user on a linux system can bind to ports above 1024. You can now send some data to this port or just hit ctrl+c to quit the process. Alternatively, you can also start netcat with strace and run strace -e bind nc -v -l 8000 to see the bind() system call that assigns an address to a socket.

For the sake of demonstration purposes we will now use firejail, which uses seccomp, to prevent calling bind().

firejail --noprofile --seccomp.drop=bind -c nc -v -l 8000

Calling this gives us a pretty immediate return to the prompt, but it is hard to figure out what has happened. So let’s use strace again:

$ firejail --noprofile --seccomp.drop=bind -c strace nc -v -l 8000
... [lots of strace output ]
bind(3, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("0.0.0.0")}, 16) = ?
+++ killed by SIGSYS (core dumped) +++

As you can see, the process has been killed. Also, running dmesg will show an audit entry:

[ 8085.803291] audit: type=1326 audit(1561549361.115:43): auid=1000 uid=1000 gid=1000 ses=7 pid=7018 comm="nc" exe="/bin/nc.openbsd" sig=31 arch=c000003e syscall=49 compat=0 ip=0x7fc2f5359877 code=0x0

In case you’re wondering about the binary name: /bin/nc is a symlink under ubuntu.

Ok, so we’re now able to create security seccomp events. Let’s put those into a cloud cluster.

Configure Auditbeat by adding cloud.id and cloud.auth parameters. Once that’s complete, restart Auditbeat and make sure you’re running auditbeat setup --dashboards to import all the Auditbeat dashboards.

Now re-execute the above firejail call a couple of times. Next up, search for seccomp violation events in your cluster:

GET auditbeat-*/_search
{
  "size": 1, 
  "query": {
    "match": {
      "event.action": "violated-seccomp-policy"
    }
  },
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}

The above query will show you the latest seccomp violation. Alternatively, you could also use the new Elastic SIEM app for further analysis. If you haven't used Elastic SIEM yet, definitely check it out. It's a free, Basic feature that's available in Elasticsearch Service, as well as the default distribution of the Elastic Stack.

seccomps in the Elastic Stack

Wrapping up

And there you have it — running seccomp in the Elastic Stack. I hope you found this helpful in further understanding how seccomp works in principle and how its used in the Elastic Stack. Perhaps you even collected a few ideas on how to integrate this in your own app! Please remember to always keep operating system features in mind instead of trying to reimplement some functionality yourself — especially if it is a well-tested functionality.

There are a couple of interesting tools like converting firewall rules to BPF or monitoring container traffic that would be worth your time to explore.

If you have further questions, don’t hesitate to check out our forum and ask all the things.