Article

Faster fuzzing with Python

For a good fuzzer it is important to maximise the cycles that exercise the target code and reduce any other overheads where possible. This is especially the case when using a non-native programming language like Python, as these tend to be slower to run than their native alternatives.

One of the most common methods for executing external processes in Python is the subprocess module. A quick and dirty example mimicking a fuzzer executing a small executable for a given number of iterations could look something like this:

import subprocess

devnull = open("/dev/null", "w")
def run():
        ps = subprocess.Popen(["/bin/gzip", "-h"], shell=False, stdout=devnull)
        ps.wait()

if __name__ == "__main__":
        for x in range(0,1000):
                run()

We can time the execution of the script to see how it performs:

nils@box:~/bench$ time python python.py

real    0m2.466s
user    0m0.452s
sys     0m2.424s

405 executions per second is not too bad, but let’s see how this compares to the equivalent functionality written in C:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

extern char **environ;

void run(int devnull) {
        int pid = fork();
        int status = 0;
        if (pid == 0) {
                dup2(devnull, 1);
                execl("/bin/gzip", "/bin/gzip", "-h", (char*)0);
        } else {
                waitpid(pid, &status, 0);
        }
}

int main() {
        int x=0;
        int devnull = open("/dev/null", O_WRONLY);
        for(x=0;x<1000;x++) run(devnull);
        return 0;
}

Again we can time the execution:

nils@box:~/bench$ gcc -o test1 test1.c
nils@box:~/bench$ time ./test1

real    0m1.115s
user    0m0.072s
sys     0m1.048s

897 executions per second, which is more than twice as fast as our Python example. Does this mean we should abandon Python for fuzzing and implement our fuzzers in plain C? Or is there any way of getting closer to the performance of native C programs with Python?

Let’s start with building up hypotheses on why Python might be slower in executing testcases than C. Three ideas come to mind:

  1. The overhead of starting the Python interpreter before running our tests leads to the slow down
  1. The subprocess module is slow and we lose valuable cycles in the Python module itself
  1. The subprocess module and our example both use fork(), which effectively duplicates the memory mappings of the parent process. Our Python interpreter is likely to have a more complex memory layout than our minimal C code, which leads to the slowdown during fork()

We can immediately disregard thesis #1 by increasing the number of iterations to 10,000:

nils@box:~/bench$ time python python.py

real    0m24.361s
user    0m3.893s
sys     0m24.462s

The executions per second increase, but only marginally. If starting Python was the overhead, we would expect the increase in iterations to have much less of an effect on the total execution time, so this is not the source of our performance problems.

Let’s look at thesis #2. In order to see whether the subprocess module itself is actually responsible for the slow down, we can get rid of it altogether and implement our own fork/exec code:

import subprocess
import os

devnull = open("/dev/null", "w")
def run():
        pid = os.fork()
        if pid == 0:
                args = ["/bin/gzip", "-h"]
                os.dup2(devnull.fileno(), 1)
                os.execvp("/bin/gzip", args)
        else:
                os.waitpid(pid, 0)

if __name__ == "__main__":
        for x in range(0,1000):
                run()

And the result:

nils@box:~/bench$ time python test2.py

real    0m2.203s
user    0m0.271s
sys     0m2.061s

453 executions per second, which is a notable improvement over the subprocess module, however it’s still significantly slower than the performance of the native C code.

This leaves us with our last thesis, that the more complex memory layout of the Python interpreter takes more time to fork. We can test this thesis by making the memory layout of the C program a little more complex. One way to do this is to allocate more memory in the parent process prior to the fork, which is shown below:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>

extern char **environ;

void run(int devnull) {
        int pid = fork();
        int status = 0;
        if (pid == 0) {
                dup2(devnull, 1);
                execl("/bin/gzip", "/bin/gzip","-h", (char*)0);
        } else {
                waitpid(pid, &status, 0);
        }
}

int main() {
        int x=0;
        int devnull = open("/dev/null", O_WRONLY);
        for(;x<1000;x++) malloc(100000);
        for(x=0;x<1000;x++) run(devnull);
        return 0;
}

And then time the execution again:

nils@box:~/bench$ gcc -o test2 test2.c
nils@box:~/bench$ time ./test2

real    0m1.962s
user    0m0.078s
sys     0m1.914s

This result suggests that complex parent processes seem to slow the performance of fork() considerably.

What are the alternatives?

Searching around for more performant alternatives to the fork/execve pattern lead us to posix_spawn, which supports starting processes using vfork (by defining the POSIX_SPAWN_USEVFORK attribute) instead of fork. The vfork man page states:

vfork() is a special case of clone(2).  It is used to create new
       processes without copying the page tables of the parent process.  It
       may be useful in performance-sensitive applications where a child is
       created which then immediately issues an execve(2).

This sounds like exactly what we want, so let’s give a try. We’ll show the Python code first, which uses a fairly dirty hack to expose the posix_spawn function to the Python code using ctypes:

import ctypes
import os

class PosixSpawn():
        def __init__(self):
                self.libc = ctypes.cdll.LoadLibrary("libc.so.6")
                self._posix_spawn = self.libc.posix_spawn
                self._posix_spawn.restype = ctypes.c_int
                self._posix_spawn.argtypes = (
                        ctypes.POINTER(ctypes.c_int),
                        ctypes.c_char_p, ctypes.c_void_p, ctypes.c_void_p,
                        ctypes.POINTER(ctypes.c_char_p),
                        ctypes.POINTER(ctypes.c_char_p)
                )
                # dirty hack: hardcoded struct sizes
                self.attrs = self.libc.malloc(336)
                self.actions = self.libc.malloc(80)
                self.devnull = open("/dev/null","wb")
                self.env = [x+"="+os.environ[x] for x in os.environ] + [ 0 ]

        def execute(self, exe, args):
                pid = ctypes.c_int()
                args = [exe] + args + [ 0 ]
                argv = (ctypes.c_char_p * 5) (*args)
                env = (ctypes.c_char_p * ( len(self.env) ))(*self.env)
                self.libc.posix_spawnattr_init(self.attrs)
                self.libc.posix_spawnattr_setflags(self.attrs, 0x40)
                self.libc.posix_spawn_file_actions_init(self.actions)
                self.libc.posix_spawn_file_actions_adddup2(self.actions, self.devnull.fileno(), 1)
                self._posix_spawn(ctypes.byref(pid), ctypes.c_char_p(exe),
                        self.actions, self.attrs,
                        ctypes.cast(argv, ctypes.POINTER(ctypes.c_char_p)),
                        ctypes.cast(env, ctypes.POINTER(ctypes.c_char_p)))
                status = ctypes.c_int()
                self.libc.waitpid(pid.value, ctypes.byref(status), 0)

if __name__ == "__main__":
        ps = PosixSpawn()
        exe = "/bin/gzip"
        for x in range(0,1000):
                ps.execute(exe, ["-h"])

And we time the execution:

nils@box:~/bench$ time python test3.py

real    0m1.133s
user    0m0.267s
sys     0m0.912s

882 testcases per second, a big improvement over the Python fork() version and almost as fast as the native C fork() code. Let’s see what a native C version using posix_spawn can achieve:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#define _USE_GNU
#include <spawn.h>

extern char **environ;

#ifndef POSIX_SPAWN_USEVFORK
#define POSIX_SPAWN_USEVFORK 0x40
#endif

void spawnrun(int devnull) {
        char *argv[] = {"/bin/gzip", "-h",(char *) 0};
        pid_t processID;
        int status = -1;
        posix_spawnattr_t spattr;
        posix_spawn_file_actions_t action;
        posix_spawnattr_init(&spattr);
        posix_spawn_file_actions_init(&action);
        posix_spawn_file_actions_adddup2(&action, devnull, 1);
        posix_spawnattr_setflags(&spattr, POSIX_SPAWN_USEVFORK);
        status = posix_spawn(&processID,"/bin/gzip",&action,&spattr,argv,environ);
        posix_spawn_file_actions_destroy(&action);
        posix_spawnattr_destroy(&spattr);
        waitpid(processID, &status, 0);
}

int main() {
        int x=0;
        int devnull = open("/dev/null", O_WRONLY);
        for(;x<1000;x++) spawnrun(devnull);
        return 0;
}

And once again we time the execution:

nils@box:~/bench$ gcc -o test3 test3.c
nils@box:~/bench$ time ./test3

real    0m0.918s
user    0m0.067s
sys     0m0.856s

As expected, this is even faster than our latest Python version. However, the difference is not nearly as significant as in our previous versions.

Conclusion

By investigating the reason for the relatively slow execution of binaries in Python, we were able to more than double the number of executed testcases per second for small binaries using posix_spawn with vfork. While there are advantages in programming in high level languages like Python performance will often suffer. By understanding the internals and being able to optimise the hot code paths it is possible to gain considerable speed-ups while still benefiting from the rapid development times provided by high-level languages.