The Linux process model, available within Elastic, allows users to write very targeted alerting rules and gain deeper insight into exactly what is happening on their Linux servers and desktops.
In this blog, we will provide background on the Linux process model, a key aspect of how Linux workloads are represented.
Linux follows the Unix process model from the 1970's that was augmented with the concept sessions in the 1980's judging from when the setsid() system call was introduced by early POSIX documents
The Linux process model is a good abstraction for recording computer workloads (which programs are run) and for writing rules to react to these events. It offers a clear representation of who did what when on which server for alerting, compliance and threat hunting.
Capturing process creation, privilege escalation and lifespans offers deep insight into how applications and services are implemented and their normal patterns of program execution. Once normal execution patterns are identified, rules may be written to send alerts when anomalous execution patterns occur.
Detailed process information permits very targeted rules to be written for alerts, which reduces false positives and alert fatigue. It also allows Linux sessions to be categorized as one of:
- autonomous services started at boot (e.g. cron)
- services providing remote access (e.g. sshd)
- interactive (likely human) remote access (e.g a bash terminal started via ssh)
- non-interactive remote access (e.g. Ansible installing software via ssh)
These categorizations permit very precise rules and review. For example, one could review all interactive sessions on specific servers in a selected timeframe.
This article describes how the Linux process model works and will assist in writing alerting and response rules for workload events. An understanding of the Linux process model is also an essential first step to understanding Containers and the namespaces and cgroups from which they are composed.
Process model capture vs. system call logs
Capturing changes to the session model in terms of new processes, new sessions, exiting processes, etc. is simpler and clearer than capturing the system calls used to enact those changes. Linux has approximately 400 system calls and does not refactor them once they are released. This approach retains a stable application binary interface (ABI), which means programs compiled to run on Linux years ago should continue to run on Linux today without rebuilding them from source code.
New system calls are added to improve capabilities or security instead of refactoring existing system calls (avoids breaking the ABI). The upshot is that mapping a time ordered list of system calls and their parameters to the logical actions they perform takes a significant amount of expertise. Additionally, newer system calls, such as those of io_uring, make it possible to read and write files and sockets with no additional system calls by using memory mapped between kernel and user space.
By contrast, the process model is stable (hasn't changed much since the 1970's) yet still comprehensively covers the actions taken on a system when one includes file access, networking and other logical operations.
Process formation: init is the first process after boot
When the Linux kernel has started, it creates a special process called “the init process.” A process embodies the execution of one or more programs. The init process always has the process id (PID) of 1 and is executed with a user id of 0 (root). Most modern Linux distributions use systemd as their init process's executable program.
The job of init is to start the configured services such as databases, web servers, and remote access services such as sshd. These services are typically encapsulated within their own sessions, which simplifies starting and stopping services by grouping all processes of each service under a single session id (SID).
Remote access, such as via the SSH protocol to an sshd service, will create a new Linux session for the accessing user. This session will initially execute the program the remote user requested — often an interactive shell — and the associated process(es) will all have the same SID.
The mechanics of creating a process
Every process, except the init process, has a single parent process. Each process has a PPID, the process id of its parent process (0/no-parent in the case of init). Reparenting can occur if a parent process exits in a way that does not also terminate the child process(es).
Reparenting usually picks init as the new parent and init has special code to clean up after these adopted children when they exit. Without this adoption and clean up code, orphaned child processes would become "zombie" processes (no kidding!). They hang around until their parent reaps them so the parent can examine their exit code — an indicator of whether the child program completed its tasks successfully.
The advent of "containers," pid namespaces in particular, necessitated the ability to designate processes other than init as "sub-reapers" (processes willing to adopt orphaned processes). Typically sub-reapers are the first process in a container. This is done because the processes in the container cannot "see" processes in the ancestor pid namespaces (i.e. their PPID value would not make sense if the parent was in an ancestor pid namespace).
To create a child process, the parent clones itself via the fork() or clone() system call. After the fork/clone, execution immediately continues in both the parent and the child (ignoring vfork() and clone()’s CLONE_VFORK option), but along different code paths by virtue of the return code value from fork()/clone().
You read that correctly: one fork()/clone() system call provides a return code in two different processes! The parent receives the PID of the child as its return code, and the child receives 0 so the shared code of the parent and child can branch based on that value. There are some cloning nuances with multi-threaded parents and copy-on-write memory for efficiency that do need to be elaborated on here. The child process inherits the memory state of the parent and its open files, network sockets, and the controlling terminal, if any.
Typically, the parent process will capture the PID of the child to monitor its lifecycle (see reaping above). The child process's behavior depends on the program that cloned itself (it provides an execution path to follow based on the return code from fork()).
A web server such as nginx might clone itself, creating a child process to handle http connections. In cases like this, the child process does not execute a new program, but simply runs a different code path in the same program to handle http connections in this case. Recall that the return value from a clone or fork tells the child that it is the child so it can choose this code path.
Interactive shell processes (e.g., one of bash, sh, fish, zsh, etc. with a controlling terminal), possibly from an ssh session, clone themselves whenever a command is entered. The child process, still running a code path from the parent/shell, does a bunch of work setting up file descriptors for IO redirection, setting the process group, and more before the code path in the child calls the execve() system call or similar to run a different program inside that process.
If you type ls into your shell, it forks your shell, the setup described above is done by the shell/child and then the ls program (usually from the /usr/bin/ls file) is executed to replace the contents of that process with the machine code for ls. This article about implementing shell job control provides great insight into the inner workings of shells and process groups.
It is important to note that a process can call execve() more than once, and therefore workload capture data models must handle this as well. This means that a process can become many different programs before it exits — not just its parent process program optionally followed by one program. See the shell exec builtin command for a way to do this in a shell (i.e. replace the shell program with another in the same process).
Another aspect of executing a program in a process is that some open file descriptors (those marked as close-on-exec) may be closed prior to the execution of the new program, while others may remain available to the new program. Recall that a single fork()/clone() call provides a return code in two processes, the parent and the child. The execve() system call is strange as well in that a successful execve() has no return code for success because it results in a new program execution so there's nowhere to return to except when execve() fails.
Creating new sessions
Linux currently creates new sessions with a single system call, setsid(), which is called by the process that becomes the new session leader. This system call is often part of the cloned child’s code path run before executing another program in that process (i.e. it’s planned by, and included in, the parent process’s code). All processes within a session share the same SID, which is the same as the PID of the process that is called setsid(), also known as the session leader. In other words, a session leader is any process with a PID that matches its SID. The exit of the session leader process will trigger termination of its immediate children process groups.
Creating new process groups
Linux uses process groups to identify a group of processes working together within a session. They will all have the same SID and process group id (PGID). The PGID is the PID of the process group leader. There is no special status for the process group leader; it may exit with no effect on other members of the process group and they retain the same PGID — even though the process with that PID no longer exists.
Note that even with pid-wrap (re-use of a recently used pid on busy systems), the Linux kernel ensures the pid of an exited process group leader is not reused until all members of that process group have exited (i.e. there is no way their PGID could accidentally refer to a new process).
Process groups are valuable for shell pipeline commands like:
cat foo.txt | grep bar | wc -l
This creates three processes for three different programs (cat, grep and wc) and connects them with pipes. Shells will create a new process group even for single program commands like ls. The purpose of process groups is to permit targeting of signals to a set of processes and to identify a set of processes — the foreground process group — that are permitted full read and write access to their session’s controlling terminal, if any.
In other words, control-C in your shell will send an interrupt signal to all processes in the foreground process group (the negative PGID value as the signal’s pid target discriminates between the group versus the process group leader process itself). The controlling terminal association ensures that processes reading input from the terminal don’t compete with each other and cause issues (terminal output may be permitted from non-foreground process groups).
Users and groups
As mentioned above, the init process has the user id 0 (root). Every process has an associated user and group and these may be used to restrict access to system calls and files. Users and groups have numeric ids and may have an associated name like root or ms. The root user is the superuser which may do anything, and should only be used when absolutely required for security reasons.
The Linux kernel only cares about ids. Names are optional and provided for human convenience by the files /etc/passwd and /etc/group. The Name Service Switch (NSS) allows these files to be extended with users and groups from LDAP and other directories (use getent passwd if you want to see the combination of /etc/passwd and NSS-provided users).
Each process may have several users and groups associated with it (real, effective, saved, and supplemental groups). See man 7 credentials for more information.
The increased use of containers whose root file systems are defined by container images has increased the likelihood of /etc/passwd and /etc/group being absent or missing some names of user and group ids that may be in use. Since the Linux kernel does not care about these names, only the ids, this is fine.
The Linux process model provides a precise and succinct way of representing server workloads which in turn allows very targeted alerting rules and review. An easy to understand per-session rendering of the process model in your browser would provide great insight into your server workloads.
Linux man pages are an excellent source of information. The man pages below have details of the Linux process model described above: