xorl %eax, %eax

CVE-2009-4537: Linux kernel RTL8169 Remote Memory Corruption

leave a comment »

This is probably the most funny bug presented by fabs in his awesome “cat /proc/sys/net/ipv4/fuckups” presentation at the 26th CCC. On 2005, Francois Romieu submitted a patch for this NIC device driver in Linux kernel that you can find here. The patch was quite simple as you can see here:

-       /* For gigabit rtl8169, MTU + header + CRC + VLAN */
-       RTL_W16(RxMaxSize, tp->rx_buf_sz);
+       /* Low hurts. Let's disable the filtering. */
+       RTL_W16(RxMaxSize, 16383);

He gave a hardcoded constant value in order to disable the hardware based RX size filtering because, as you can read in his comments:

- disable hardware Rx size filtering: so far it only proved to be able
  to trigger some new fancy errors;

Understandable, but let’s see what happened on 2009. On January 2009 a new remote memory corruption bug was disclosed for this device driver. You can read my analysis about that one here. This can also be found under CVE-2009-1389 name.
Now, here comes the funny part… Among other things, to fix this vulnerability the patch included this:

 /* Low hurts. Let's disable the filtering. */
-	RTL_W16(RxMaxSize, 16383);
+	RTL_W16(RxMaxSize, rx_buf_sz);
 }

Isn’t this just ironic?
They re-enabled the hardware based RX size filtering and thus the “fancy errors” that Francois Romieu was talking about on 2005. fabs noticed this and he was able to trigger those fancy errors when the receiving frames were exactly the same as the ‘RxMaxSize’ (which by default is 1532/1533 bytes). As we can read in his slides (slide 81):

On receipt of the frame, the device reports
that several fragments of over 8000 bytes
have been received.

But the important part was that as he discovered, the RX buffers contain the old frames’ payload and consequently, you have somewhat complete control of some values. Now, if we move to drivers/net/r8169.c we can read the following:

static int rtl8169_rx_interrupt(struct net_device *dev,
                                struct rtl8169_private *tp,
                                void __iomem *ioaddr, u32 budget)
{
        unsigned int cur_rx, rx_left;
        unsigned int delta, count;
      ...
                if (status & DescOwn)
                        break;
                if (unlikely(status & RxRES)) {
      ...
                        if (status & RxFOVF) {
                                rtl8169_schedule_work(dev, rtl8169_reset_task);
                                dev->stats.rx_fifo_errors++;
                        }
                        rtl8169_mark_to_asic(desc, tp->rx_buf_sz);
                } else {
                        struct sk_buff *skb = tp->Rx_skbuff[entry];
                        dma_addr_t addr = le64_to_cpu(desc->addr);
                        int pkt_size = (status & 0x00001FFF) - 4;
                        struct pci_dev *pdev = tp->pci_dev;

                        /*
                         * The driver does not support incoming fragmented
                         * frames. They are seen as a symptom of over-mtu
                         * sized frames.
                         */
                        if (unlikely(rtl8169_fragmented_frame(status))) {
                                dev->stats.rx_dropped++;
                                dev->stats.rx_length_errors++;
                                rtl8169_mark_to_asic(desc, tp->rx_buf_sz);
                                continue;
                        }

                        rtl8169_rx_csum(skb, desc);

                        if (rtl8169_try_rx_copy(&skb, tp, pkt_size, addr)) {
                                pci_dma_sync_single_for_device(pdev, addr,
                                        pkt_size, PCI_DMA_FROMDEVICE);
                                rtl8169_mark_to_asic(desc, tp->rx_buf_sz);
       ...
        return count;
}

So basically, what fabs did to exploit this was:
1) “spray” the RX buffers with the desired ‘status’ value
2) Create and send a frame with the maximum available RX size to trigger the bug
3) Send a ping to trigger the above interrupt handler
When this happens, there are two possible code paths that are both based on the ‘status’ value, the first one is triggered if the status returns true when the logical AND with ‘RxRES’ and later ‘RxFOVF’ take place. Both of these values are defined at drivers/net/r8169.c like this:

enum rtl_register_content {
       ...
        /* RxStatusDesc */
        RxFOVF  = (1 << 23),
        RxRWT   = (1 << 22),
        RxRES   = (1 << 21),
        RxRUNT  = (1 << 20),
        RxCRC   = (1 << 19),
       ...
};

If the attacker sends a status that triggers this code path, rtl8169_schedule_work() will be invoked and it will reset the task and thus return to the previous state with no further errors. On the other hand, if status doesn’t match this and moves to the else clause, it will attempt to receive the frame. From slide 85 of fabs’ presentation we can read:

We’ve built a PoC, which first sprays ‘A’s
and then ‘E’s to stop the device for a
number of frames and then reset it!

Now, let’s move to the else part and examine the code. After retrieving the ‘skbuff’ and the equivalent ‘addr’, it masks the status with ‘0x00001FFF’ and subtracts 4 out of it. According to fabs’ tests, this leaves us with signed integer ‘pkt_size’ equal to -4. After a check for fragmented frames, it will call rtl8169_try_rx_copy() to perform the actual copy.

static int rx_copybreak = 200;
       ...
static inline bool rtl8169_try_rx_copy(struct sk_buff **sk_buff,
                                       struct rtl8169_private *tp, int pkt_size,
                                       dma_addr_t addr)
{
        struct sk_buff *skb;
        bool done = false;

        if (pkt_size >= rx_copybreak)
                goto out;

        skb = netdev_alloc_skb(tp->dev, pkt_size + NET_IP_ALIGN);
        if (!skb)
                goto out;

        pci_dma_sync_single_for_cpu(tp->pci_dev, addr, pkt_size,
                                    PCI_DMA_FROMDEVICE);
        skb_reserve(skb, NET_IP_ALIGN);
        skb_copy_from_linear_data(*sk_buff, skb->data, pkt_size);
        *sk_buff = skb;
        done = true;
out:
        return done;
}

Most people would think that this nice bug will ends in the first check since ‘pkt_size’ is now -4, however, as fabs pointed out ‘rx_copybreak’ is also signed integer and of course, -4 is less than 200 and thus, it will pass this check. The next routine is the memory allocation function netdev_alloc_skb(), this function will parse its second argument as an unsigned integer, so ‘pkt_size + NET_IP_ALIGN’ results in -2 since that constant is defined in include/linux/skbuff.h like that:

/*
 * CPUs often take a performance hit when accessing unaligned memory
 * locations. The actual performance hit varies, it can be small if the
 * hardware handles it or large if we have to take an exception and fix it
 * in software.
 *
 * Since an ethernet header is 14 bytes network drivers often end up with
 * the IP header at an unaligned offset. The IP header can be aligned by
 * shifting the start of the packet by 2 bytes. Drivers should do this
 * with:
 *
 * skb_reserve(skb, NET_IP_ALIGN);
 *
 * The downside to this alignment of the IP header is that the DMA is now
 * unaligned. On some architectures the cost of an unaligned DMA is high
 * and this cost outweighs the gains made by aligning the IP header.
 *
 * Since this trade off varies between architectures, we allow NET_IP_ALIGN
 * to be overridden.
 */
#ifndef NET_IP_ALIGN
#define NET_IP_ALIGN    2
#endif

Once again, it seems really unlikely that you would be able to allocate -2 (which is 4294967294 in an unsigned value) bytes but fabs saw something else here. netdev_alloc_skb() is just a wrapper around __netdev_alloc_skb() which resides in net/core/skbuff.c and you can see it here:

/**
 *      __netdev_alloc_skb - allocate an skbuff for rx on a specific device
 *      @dev: network device to receive on
 *      @length: length to allocate
 *      @gfp_mask: get_free_pages mask, passed to alloc_skb
 *
 *      Allocate a new &sk_buff and assign it a usage count of one. The
 *      buffer has unspecified headroom built in. Users should allocate
 *      the headroom they think they need without accounting for the
 *      built in space. The built in space is used for optimisations.
 *
 *      %NULL is returned if there is no free memory.
 */
struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
                unsigned int length, gfp_t gfp_mask)
{
        int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
        struct sk_buff *skb;

        skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
        if (likely(skb)) {
                skb_reserve(skb, NET_SKB_PAD);
                skb->dev = dev;
        }
        return skb;
}
EXPORT_SYMBOL(__netdev_alloc_skb);

It doesn’t directly pass the length to the __alloc_skb() allocation routine but it first adds ‘NET_SKB_PAD’ to it which is 32 as you can find at include/linux/skbuff.h:

/*
 * The networking layer reserves some headroom in skb data (via
 * dev_alloc_skb). This is used to avoid having to reallocate skb data when
 * the header has to grow. In the default case, if the header has to grow
 * 32 bytes or less we avoid the reallocation.
 *
 * Unfortunately this headroom changes the DMA alignment of the resulting
 * network packet. As for NET_IP_ALIGN, this unaligned DMA is expensive
 * on some architectures. An architecture can override this value,
 * perhaps setting it to a cacheline in size (since that will maintain
 * cacheline alignment of the DMA). It must be a power of 2.
 *
 * Various parts of the networking layer expect at least 32 bytes of
 * headroom, you should not reduce this.
 */
#ifndef NET_SKB_PAD
#define NET_SKB_PAD     32
#endif

This means that the almost 4GB which is the -2 as an unsigned value will now be: -2 + 32 which results in 30!!! At last, the 30 bytes will be allocated but when the execution moves back to rtl8169_try_rx_copy() and skb_copy_from_linear_data() (which is just a wrapper around memcpy()) will be called, the kernel will attempt to copy ‘pkt_size’ bytes (which is almost 4GB) from ‘*sk_buff’ to ‘skb->data’ (that is only 30 bytes long) which will result in a nice kernel memory corruption.
Really awesome vulnerability!

Written by xorl

January 2, 2010 at 07:48

Posted in bugs, linux

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s