xorl %eax, %eax

CVE-2009-0024: Linux kernel remap_file_pages(2) Race Condition

leave a comment »

This issue was reported by Nelson Elhage and affects Linux kernel prior to 2.6.24.1. The code presented here is taken from 2.6.24 release of the Linux kernel. Vulnerable code is part of mm/fremap.c and specifically on this system call:

100/**
101 * sys_remap_file_pages - remap arbitrary pages of an existing VM_SHARED vma
102 * @start: start of the remapped virtual memory range
103 * @size: size of the remapped virtual memory range
104 * @prot: new protection bits of the range (see NOTE)
105 * @pgoff: to-be-mapped page of the backing store file
106 * @flags: 0 or MAP_NONBLOCKED - the later will cause no IO.
107 *
108 * sys_remap_file_pages remaps arbitrary pages of an existing VM_SHARED vma
109 * (shared backing store file).
110 *
111 * This syscall works purely via pagetables, so it's the most efficient
112 * way to map the same (large) file into a given virtual window. Unlike
113 * mmap()/mremap() it does not create any new vmas. The new mappings are
114 * also safe across swapout.
115 *
116 * NOTE: the 'prot' parameter right now is ignored (but must be zero),
117 * and the vma's default protection is used. Arbitrary protections
118 * might be implemented in the future.
119 */
120 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
121        unsigned long prot, unsigned long pgoff, unsigned long flags)
122 {


The provided comments are the best way to describe this function’s operation. Now, if you read this and understood what it does you can continue. I will skip everything apart the vulnerability in this routine. Here is is:

123        struct mm_struct *mm = current->mm;
124        struct address_space *mapping;
125        unsigned long end = start + size;
126        struct vm_area_struct *vma;
127        int err = -EINVAL;
128        int has_write_lock = 0;
         ...
132        /*
133         * Sanitize the syscall parameters:
134         */
135        start = start & PAGE_MASK;
136        size = size & PAGE_MASK;
         ...
151        vma = find_vma(mm, start);
152
153        /*
154         * Make sure the vma is shared, that it supports prefaulting,
155         * and that the remapped range is valid and fully within
156         * the single existing vma.  vm_private_data is used as a
157         * swapout cursor in a VM_NONLINEAR vma.
158         */
         ...
185                mapping = vma->vm_file->f_mapping;
186                /*
187                 * page_mkclean doesn't work on nonlinear vmas, so if
188                 * dirty pages need to be accounted, emulate with linear
189                 * vmas.
190                 */
191                if (mapping_cap_account_dirty(mapping)) {
192                        unsigned long addr;
193
194                        flags &= MAP_NONBLOCK;
195                        addr = mmap_region(vma->vm_file, start, size,
196                                        flags, vma->vm_flags, pgoff, 1);
         ...
235        return err;
236 }


Have a look at vma. It’s a pointer to a vm_area_struct structure. This structure is defined at include/linux/mm_types.h. Here are a few noteworthy members:

99 struct vm_area_struct {
100        struct mm_struct * vm_mm;       /* The address space we belong to. */
101        unsigned long vm_start;         /* Our start address within vm_mm. */
         ...
141        /* Information about our backing store: */
142        unsigned long vm_pgoff;         /* Offset (within vm_file) in PAGE_SIZE
143                                           units, *not* PAGE_CACHE_SIZE */
144        struct file * vm_file;          /* File we map to (can be NULL). */
145        void * vm_private_data;         /* was vm_pte (shared mem) */
         ...
154 };


From now on, we only care about the vm_file. Now, moving back to sys_remap_file_pages() we can see that it initializes this structure at line 151 using find_vma() routine. This function is part of mm/mmap.c and it is used to find the first VMA which satisfies addr < vm_end. Following, it performs numerous checks to ensure that what comment at lines 186-190 mention. At last, it’ll invoke mmap_region() at line 195. This routine is also part of mm/mmap.c and it is used to mmap() the requested VMA region. However, the latter may fall into this:

1067 unsigned long mmap_region(struct file *file, unsigned long addr,
1068                           unsigned long len, unsigned long flags,
1069                           unsigned int vm_flags, unsigned long pgoff,
1070                           int accountable)
1071 {
1072        struct mm_struct *mm = current->mm;
        ...
1078        struct inode *inode =  file ? file->f_path.dentry->d_inode : NULL;
        ...
1084        if (vma && vma->vm_start < addr + len) {
1085                if (do_munmap(mm, addr, len))
1086                        return -ENOMEM;
1087                goto munmap_back;
1088        }
        ...
1219        return error;
1220 }

Since sys_remap_file_pages() passes vma->vm_file to mmap_region() it might fall into this condition and un-map its VMA. Moreover, during this code path vm_file is not sanitized and it might be accessed or manipulated by other processes or threads. By doing this, an attacker can change vm_file and leading the next call to do_munmap() at line 1085 having an invalid pointer to a file for vm_file. Clearly this can result in code execution. To patch this, Oleg Nesterov committed this diff file:

                if (mapping_cap_account_dirty(mapping)) {
                        unsigned long addr;
+                       struct file *file = vma->vm_file;

                        flags &= MAP_NONBLOCK;
-                       addr = mmap_region(vma->vm_file, start, size,
+                       get_file(file);
+                       addr = mmap_region(file, start, size,
                                        flags, vma->vm_flags, pgoff, 1);
+                       fput(file);
                        if (IS_ERR_VALUE(addr)) {


Firstly, it temporarily creates a structure pointing to vma->vm_file. Then, it replaces the direct call to mmap_region() with a lock on this file using get_file() macro from include/linux/fs.h which is simply:

816 #define get_file(x)     atomic_inc(&(x)->f_count)

And as defined at include/asm-x86/atomic_32.h:

89 /**
90  * atomic_inc - increment atomic variable
91  * @v: pointer of type atomic_t
92  * 
93  * Atomically increments @v by 1.
94  */ 
95 static __inline__ void atomic_inc(atomic_t *v)
96 {
97        __asm__ __volatile__(
98                LOCK_PREFIX "incl %0"
99                :"+m" (v->counter));
100 }


And then performing the call to mmap_region() in a secure manner since file is locked. Finally, it invokes fput() which a function of fs/file_table.c that decrements the lock and tests it like this:

200 void fastcall fput(struct file *file)
201 {
202        if (atomic_dec_and_test(&file->f_count))
203                __fput(file);
204 }
205
206 EXPORT_SYMBOL(fput);

This macro can be also found at your architecture specific atomic operations like this:

115 /**
116  * atomic_dec_and_test - decrement and test
117  * @v: pointer of type atomic_t
118  * 
119  * Atomically decrements @v by 1 and
120  * returns true if the result is 0, or false for all other
121  * cases.
122  */ 
123 static __inline__ int atomic_dec_and_test(atomic_t *v)
124 {
125        unsigned char c;
126
127        __asm__ __volatile__(
128                LOCK_PREFIX "decl %0; sete %1"
129                :"+m" (v->counter), "=qm" (c)
130                : : "memory");
131        return c != 0;
132 }

I know that this is not required since most people already know it but for consistency purposes here is the LOCK_PREFIX as defined at include/asm-x86/alternative_32.h:

106/*
107 * Alternative inline assembly for SMP.
108 *
109 * The LOCK_PREFIX macro defined here replaces the LOCK and
110 * LOCK_PREFIX macros used everywhere in the source tree.
111 *
112 * SMP alternatives use the same data structures as the other
113 * alternatives and the X86_FEATURE_UP flag to indicate the case of a
114 * UP system running a SMP kernel.  The existing apply_alternatives()
115 * works fine for patching a SMP kernel for UP.
116 *
117 * The SMP alternative tables can be kept after boot and contain both
118 * UP and SMP versions of the instructions to allow switching back to
119 * SMP at runtime, when hotplugging in a new CPU, which is especially
120 * useful in virtualized environments.
121 *
122 * The very common lock prefix is handled as special case in a
123 * separate table which is a pure address list without replacement ptr
124 * and size information.  That keeps the table sizes small.
125 */
126
127 #ifdef CONFIG_SMP
128 #define LOCK_PREFIX \
129                ".section .smp_locks,\"a\"\n"   \
130                "  .align 4\n"                  \
131                "  .long 661f\n" /* address */  \
132                ".previous\n"                   \
133                "661:\n\tlock; "
134
135 #else /* ! CONFIG_SMP */
136 #define LOCK_PREFIX ""
137 #endif

Of course exploiting this bug isn’t something trivial but you can easily access it directly since it is a just system call.

Written by xorl

April 7, 2009 at 12:20

Posted in bugs, linux

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s