Linux kernel user to kernel space range checks

So, I was recently looking at a Linux kernel code that as it turns out it isn’t buggy. I was tricked because of the lack of the classic sequence of access_ok() before performing the actual read operation from the userland. That’s why I decided to make this post here to give an explanation of the range checks for user to kernel space copy (read/write) operations of the Linux kernel.
For convenience I’ll stick with x86 architecture in this post since this is the most widely deployed one. As you probably already know, if there is no check on the user supplied pointer, it could lead to pretty bad vulnerabilities because a user could specify a pointer located to kernel space and consequently force the kernel into performing a read or write operation to an arbitrary kernel location.

access_ok()
The most common check to avoid this is the access_ok() macro which can be found at /arch/x86/include/asm/uaccess.h for the x86 architecture.

/**
 * access_ok: - Checks if a user space pointer is valid
 * @type: Type of access: %VERIFY_READ or %VERIFY_WRITE.  Note that
 *        %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe
 *        to write to a block, it is always safe to read from it.
 * @addr: User space pointer to start of block to check
 * @size: Size of block to check
 *
 * Context: User context only.  This function may sleep.
 *
 * Checks if a pointer to a block of memory in user space is valid.
 *
 * Returns true (nonzero) if the memory block may be valid, false (zero)
 * if it is definitely invalid.
 *
 * Note that, depending on architecture, this function probably just
 * checks that the pointer is in the user space range - after calling
 * this function, memory access functions may still return -EFAULT.
 */
#define access_ok(type, addr, size) (likely(__range_not_ok(addr, size) == 0))

You can read the comment to get an introduction to that useful Linux kernel’s C macro. So, likely() C macro is used for efficiency (branch prediction) and as you can see, access_ok() simply calls __range_not_ok() which can be found in the same source code file.

/*
 * Test whether a block of memory is a valid user space address.
 * Returns 0 if the range is valid, nonzero otherwise.
 *
 * This is equivalent to the following test:
 * (u33)addr + (u33)size >= (u33)current->addr_limit.seg (u65 for x86_64)
 *
 * This needs 33-bit (65-bit for x86_64) arithmetic. We have a carry...
 */

#define __range_not_ok(addr, size)                                      \
({                                                                      \
        unsigned long flag, roksum;                                     \
        __chk_user_ptr(addr);                                           \
        asm("add %3,%1 ; sbb %0,%0 ; cmp %1,%4 ; sbb $0,%0"             \
            : "=&r" (flag), "=r" (roksum)                               \
            : "1" (addr), "g" ((long)(size)),                           \
              "rm" (current_thread_info()->addr_limit.seg));            \
        flag;                                                           \
})

To begin with, __chk_user_ptr() is a compiler check that is normally disabled since it’s usually utilized by sparse Linux checker as we can read at include/linux/compiler.h.

#ifdef __CHECKER__
  ...
extern void __chk_user_ptr(const volatile void __user *);
  ...
#else
  ...
# define __chk_user_ptr(x) (void)0
  ...
#endif

The rest of the code is a simple GCC inline assembly code that basically does the following:

add size, addr                                  ; get the range of the requested pointer
sbb 0, flag                                     ; substract the carry flag (CF) from 'flag'
cmp addr, current_thread_info()->addr_limit.seg ; compare the thread's address limit with the requested pointer
sbb flag, flag                                  ; this will update the flag based on the compare instruction
                                                ; that it could change the CF flag.

So, this is the range check that access_ok() performs. It checks that the given range (address + size) is within the current thread’s limit. For completeness, here is current_thread_info() macro which is defined at arch/x86/include/asm/thread_info.h header file.

/* how to get the current stack pointer from C */
register unsigned long current_stack_pointer asm("esp") __used;

/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
        return (struct thread_info *)
                (current_stack_pointer & ~(THREAD_SIZE - 1));
}

That returns the value contained in ESP pointer aligned with ‘THREAD_SIZE’. In the same C header file we can find the ‘thread_info’ structure which is this:

struct thread_info {
        struct task_struct      *task;          /* main task structure */
        struct exec_domain      *exec_domain;   /* execution domain */
        __u32                   flags;          /* low level flags */
        __u32                   status;         /* thread synchronous flags */
        __u32                   cpu;            /* current CPU */
        int                     preempt_count;  /* 0 => preemptable,
                                                   <0 => BUG */
        mm_segment_t            addr_limit;
        struct restart_block    restart_block;
        void __user             *sysenter_return;
#ifdef CONFIG_X86_32
        unsigned long           previous_esp;   /* ESP of the previous stack in
                                                   case of nested (IRQ) stacks
                                                */
        __u8                    supervisor_stack[0];
#endif
        int                     uaccess_err;
};

And ‘addr_limit’ which is the value containing our thread’s address limit is of ‘mm_segment_t’ type which is defined at arch/x86/include/asm/processor.h.

typedef struct {
        unsigned long           seg;
} mm_segment_t;

Before moving on to the next one, it’s important to note here that access_ok() on the x86 architecture completely ignores the type of access to be performed (VERIFY_READ or VERIFY_WRITE). It only checks the range of the given address.

get_user()/put_user()

This is a common call for reading or writing variables from user space. Since the aim of this post is to deal with the range checks I’m just going to cover get_user() because the same range checks are being employed by put_user() too. Its code resides at arch/x86/include/asm/uaccess.h header file for the x86 architecture that we’ll be looking at.

#define get_user(x, ptr)                                                \
({                                                                      \
        int __ret_gu;                                                   \
        unsigned long __val_gu;                                         \
        __chk_user_ptr(ptr);                                            \
        might_fault();                                                  \
        switch (sizeof(*(ptr))) {                                       \
        case 1:                                                         \
                __get_user_x(1, __ret_gu, __val_gu, ptr);               \
                break;                                                  \
        case 2:                                                         \
                __get_user_x(2, __ret_gu, __val_gu, ptr);               \
                break;                                                  \
        case 4:                                                         \
                __get_user_x(4, __ret_gu, __val_gu, ptr);               \
                break;                                                  \
        case 8:                                                         \
                __get_user_8(__ret_gu, __val_gu, ptr);                  \
                break;                                                  \
        default:                                                        \
                __get_user_x(X, __ret_gu, __val_gu, ptr);               \
                break;                                                  \
        }                                                               \
        (x) = (__typeof__(*(ptr)))__val_gu;                             \
        __ret_gu;                                                       \
})

We have already talked about __chk_user_ptr() macro so we can move on to might_fault() inline function which can be found at include/linux/kernel.h.

#ifdef CONFIG_PROVE_LOCKING
void might_fault(void);
#else
static inline void might_fault(void)
{
        might_sleep();
}
#endif

Unless locking debugging is enabled through ‘CONFIG_PROVE_LOCKING‘, this will execute might_sleep() macro also available in the same source code file.

#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
  void __might_sleep(const char *file, int line, int preempt_offset);
/**
 * might_sleep - annotation for functions that can sleep
 *
 * this macro will print a stack trace if it is executed in an atomic
 * context (spinlock, irq-handler, ...).
 *
 * This is a useful debugging help to be able to catch problems early and not
 * be bitten later when the calling function happens to sleep when it is not
 * supposed to.
 */
# define might_sleep() \
        do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
#else
  static inline void __might_sleep(const char *file, int line,
                                   int preempt_offset) { }
# define might_sleep() do { might_resched(); } while (0)
#endif

In case of a kernel with ‘CONFIG_DEBUG_SPINLOCK_SLEEP‘ option enabled, __might_sleep() and might_resched() will be invoked. Otherwise, only might_resced() is being called.

void __might_sleep(const char *file, int line, int preempt_offset)
{
#ifdef in_atomic
        static unsigned long prev_jiffy;        /* ratelimiting */

        if ((preempt_count_equals(preempt_offset) && !irqs_disabled()) ||
            system_state != SYSTEM_RUNNING || oops_in_progress)
                return;
        if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
                return;
        prev_jiffy = jiffies;

        printk(KERN_ERR
                "BUG: sleeping function called from invalid context at %s:%d\n",
                        file, line);
        printk(KERN_ERR
                "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
                        in_atomic(), irqs_disabled(),
                        current->pid, current->comm);

        debug_show_held_locks(current);
        if (irqs_disabled())
                print_irqtrace_events(current);
        dump_stack();
#endif
}
EXPORT_SYMBOL(__might_sleep);

Obviously, the whole routine is only enabled if ‘in_atomic’ that checks if this is running in an atomic context is true as we can read at include/linux/hardirq.h header file. If this is the case, it will make an initial check to ensure that preemptable counter hasn’t reached preemptive offset and IRQs aren’t disabled using irqs_disabled() macro. If this is true it will immediately return. Also, if the system’s state is ‘SYSTEM_RUNNING’ or there is a kernel OOPS in progress it will also return since there is either no error or there is an ongoing kernel OOPS. Next, it will update the ‘jiffies’ value and print some debugging information for the sleeping function.
The other macro that will be likely invoked is the might_resched() which resides at include/linux/kernel.h header file.

#ifdef CONFIG_PREEMPT_VOLUNTARY
extern int _cond_resched(void);
# define might_resched() _cond_resched()
#else
# define might_resched() do { } while (0)
#endif

If ‘CONFIG_PREEMPT_VOLUNTARY’ isn’t defined it would do nothing. But if it is defined it will call _cond_resched() which is also part of kernel/sched.c file.

static inline int should_resched(void)
{
        return need_resched() && !(preempt_count() & PREEMPT_ACTIVE);
}

static void __cond_resched(void)
{
        add_preempt_count(PREEMPT_ACTIVE);
        schedule();
        sub_preempt_count(PREEMPT_ACTIVE);
}

int __sched _cond_resched(void)
{
        if (should_resched()) {
                __cond_resched();
                return 1;
        }
        return 0;
}
EXPORT_SYMBOL(_cond_resched);

If the function requires rescheduling (and of course, preemptive kernel is active), it will call __cond_resched() that increments preemptive counter using add_preempt_count(), then schedules the task through schedule() and finally decrements the counter.
We won’t go any deeper to the sleeping and scheduling since it’s out of the scope of this post. Back to get_user() you can see that depending on the size of the given pointer it will invoke __get_user_x() changing its first argument to either 1, 2, 4 , 8 or anything else passed to it through ‘X’ variable. This moves us to arch/x86/include/asm/uaccess.h where this code resides.

extern int __get_user_1(void);
extern int __get_user_2(void);
extern int __get_user_4(void);
extern int __get_user_8(void);
extern int __get_user_bad(void);

#define __get_user_x(size, ret, x, ptr)               \
        asm volatile("call __get_user_" #size         \
                     : "=a" (ret), "=d" (x)           \
                     : "0" (ptr))                     \

As you can read here, this inline assembly code will simply call __get_user_{1,2,4,8,X} depending on the previously given value. Those functions are available at arch/x86/lib/getuser.S and since we only care about the range check I will only discuss the __get_user_1() which is this:

        .text
ENTRY(__get_user_1)
        CFI_STARTPROC
        GET_THREAD_INFO(%_ASM_DX)
        cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
        jae bad_get_user
1:      movzb (%_ASM_AX),%edx
        xor %eax,%eax
        ret
        CFI_ENDPROC
ENDPROC(__get_user_1)

The ‘CFI_STARTPROC’ initializes DWARF-2 debugging section ‘.cfi_startproc’ if ‘CONFIG_AS_CFI’ is enabled. The next call to GET_THREAD_INFO() it will return the thread info structure for the given pointer as we can read at arch/x86/include/asm/thread_info.h.

/* how to get the thread information struct from ASM */
#define GET_THREAD_INFO(reg)     \
        movl $-THREAD_SIZE, reg; \
        andl %esp, reg

That moves ‘THREAD_SIZE’ to ‘reg’ and then simply alignes ESP with that mask.
As you can see we then have a compare instruction that checks the user address ‘%_ASM_AX’ with the value of ‘TI_addr_limit’ passing the previously acquired ‘thread_info’ structure’s address to it. The latter symbol is defined at arch/x86/kernel/asm-offsets_32.c like this:

/* workaround for a warning with -Wmissing-prototypes */
void foo(void);

void foo(void)
{
      ...
        OFFSET(TI_addr_limit, thread_info, addr_limit);
      ...
}

And this is defined at arch/um/sys-i386/user-offsets.c like this:

#define DEFINE(sym, val) \
        asm volatile("\n->" #sym " %0 " #val : : "i" (val))

#define DEFINE_LONGS(sym, val) \
        asm volatile("\n->" #sym " %0 " #val : : "i" (val/sizeof(unsigned long)))

#define OFFSET(sym, str, mem) \
        DEFINE(sym, offsetof(struct str, mem));

Which defines symbol ‘TI_addr_limit’ having the ‘addr_limit’ value which is contained in ‘thread_info’ structure. Basically, ‘TI_addr_limit’ for the given pointer is another way of accessing current_thread_info()->addr_limit.seg which was discussed earlier.
So, if everything is fine the next ‘jae’ (Jump if Above or Equal) will be skipped and ‘movzb’ will copy a single Byte of the user provided pointer to EDX general purpose register, then zero out EAX (which is the function’s return value) and return. However, if the comparison was incorrect then bad_get_user() will be executed.

bad_get_user:
        CFI_STARTPROC
        xor %edx,%edx
        mov $(-EFAULT),%_ASM_AX
        ret
        CFI_ENDPROC
END(bad_get_user)

That will clear out the destination EDX register, set the user’s return value to ‘-EFAULT’ and return. That was pretty much the range check inside get_user()/put_user().

copy_to_user()/copy_from_user()
Those are the most common functions for reading and writing data to and from user space. Once again, since our goal is to understand the range checks of these routines I’ll just use copy_from_user() since the same range check applies to copy_to_user() routine too. The code of that function is placed at arch/x86/include/asm/uaccess_32.h header file.

static inline unsigned long __must_check copy_from_user(void *to,
                                          const void __user *from,
                                          unsigned long n)
{
        int sz = __compiletime_object_size(to);

        if (likely(sz == -1 || sz >= n))
                n = _copy_from_user(to, from, n);
        else
                copy_from_user_overflow();

        return n;
}

As you can read, initially it will call __compiletime_object_size() on the kernel space pointer ‘to’. This C macro is defined at include/linux/compiler.h like this:

/* Compile time object size, -1 for unknown */
#ifndef __compiletime_object_size
# define __compiletime_object_size(obj) -1
#endif
#ifndef __compiletime_warning
# define __compiletime_warning(message)
#endif
#ifndef __compiletime_error
# define __compiletime_error(message)
#endif

And the actual code of those macros is at include/linux/compiler-gcc4.h…

#if __GNUC_MINOR__ > 0
#define __compiletime_object_size(obj) __builtin_object_size(obj, 0)
#endif
#if __GNUC_MINOR__ >= 4
#define __compiletime_warning(message) __attribute__((warning(message)))
#define __compiletime_error(message) __attribute__((error(message)))
#endif

So, if compiled with newer GCC releases (4.0 or newer) it will use the __builtin_object_size() function to return the object’s size. Back to copy_from_user() we can read that it checks that object’s size isn’t greater than size of data to be copied and that is neither -1. If this is the case, it will move to the next level and call _copy_from_user(). Otherwise, it will call copy_from_user_overflow() from arch/x86/lib/usercopy_32.c.

void copy_from_user_overflow(void)
{
        WARN(1, "Buffer overflow detected!\n");
}
EXPORT_SYMBOL(copy_from_user_overflow);

That prints a warning and returns. Assuming that it passed the checks, here is the code that will be executed:

unsigned long
_copy_from_user(void *to, const void __user *from, unsigned long n)
{
        if (access_ok(VERIFY_READ, from, n))
                n = __copy_from_user(to, from, n);
        else
                memset(to, 0, n);
        return n;
}
EXPORT_SYMBOL(_copy_from_user);

It uses access_ok() to check the user space ‘from’ pointer for ‘n’ Bytes and if it’s within the address limit it will continue to the actual copy through __copy_from_user(). If this fails, it will just zero out the kernel memory pointed by ‘to’ pointer and return.

There is a person who I should thank here but I don’t know if he wants to be referenced. :)

Written by xorl

October 25, 2010 at 04:06

Posted in linux, security

6 Responses

Subscribe to comments with RSS.

Hi,

Ever come across the term literate programming? I think you are a pretty good example of it.

sin

October 25, 2010 at 18:56
haha!
There wouldn’t be no posts in this blog if I was avoiding literate programming.
After all… Knuth is c00l :P

xorl

October 25, 2010 at 19:50
The more I read this blog the more I want to read this blog. Great work xorl :).

HappyHax0r

October 27, 2010 at 04:17
I found this page because errors just started showing up around this in recent kernels. Interesting.

Interested party

March 1, 2012 at 21:03
Found a good domain name on the jump video movies, information Here

Thomascog

January 14, 2014 at 10:33
typo: “sbb 0, flag” and “sbb flag, flag” should be exchanged

zhangyoufu

June 13, 2015 at 18:15

xorl %eax, %eax