By Alan Cao
If you love exploit mitigations, you may have heard of a new system call named mseal
landing into the Linux kernel’s 6.10 release, providing a protection called “memory sealing.” Beyond notes from the authors, very little information about this mitigation exists. In this blog post, we’ll explain what this syscall is, including how it’s different from prior memory protection schemes and how it works in the kernel to protect virtual memory. We’ll also describe the particular exploit scenarios that mseal
helps stop in Linux userspace, such as stopping malicious permissions tampering and preventing memory unmapping attacks.
What mseal is (and isn’t)
Memory sealing allows developers to make memory regions immutable from illicit modifications during program runtime. When a virtual memory address (VMA) range is sealed, an attacker with a code execution primitive cannot perform subsequent virtual memory operations to change the VMA’s permissions or modify how it is laid out for their benefit.
If you’re like me and followed the spicy discourse surrounding this syscall in the kernel mailing lists, you may have observed that Chrome’s Security team introduced it to support their V8 CFI strategy, initially for Linux-based ChromeOS. After some lengthy deliberation and several rewrites, it finally landed in the kernel, with plans to expand its use case beyond browsers with its integration into glibc, possibly in version 2.41.
mseal
’s security guarantees are unlike Linux’s memfd_create
and its memfd_secret
variant, which provide file sealing. memfd_create
and memfd_secret
allow one to create RAM-backed anonymous files as an alternative to storing content to tmpfs
, with memfd_secret
taking it a step further by ensuring that the region of memory is accessible only to the process holding the file descriptor. This lets developers create “secure enclave”-style userspace mappings that can guard sensitive in-memory data.
mseal
digresses from prior memory protection schemes on Linux because it is a syscall tailored specifically for exploit mitigation against remote attackers seeking code execution rather than potentially local ones looking to exfiltrate sensitive secrets in-memory.
To understand mseal
’s security mitigations, we must first study its implementation to understand how it operates. Luckily, mseal
is simple to understand, so let’s look at how it works in the kernel!
A look under the hood
mseal
has a simple function signature:
int mseal(unsigned start addr, size_t len, unsigned long flags)
start
andlen
represent the start/end range of a valid VMA that we want to seal, and len must be properly page-aligned.flags
are unused at the time of writing and must be set to 0.
In the 6.12 kernel, its syscall definition calls do_mseal
:
static int do_mseal(unsigned long start, size_t len_in, unsigned long flags) { size_t len; int ret = 0; unsigned long end; struct mm_struct *mm = current->mm; // [1] // ... Check flags == 0, check page alignment, and compute `end` if (mmap_write_lock_killable(mm)) // [2] return -EINTR; /* * First pass, this helps to avoid * partial sealing in case of error in input address range, * e.g. ENOMEM error. */ ret = check_mm_seal(start, end); // [3] if (ret) goto out; /* * Second pass, this should success, unless there are errors * from vma_modify_flags, e.g. merge/split error, or process * reaching the max supported VMAs, however, those cases shall * be rare. */ ret = apply_mm_seal(start, end); // [4] out: mmap_write_unlock(current->mm); return ret; }
do_mseal
will first compute an end
offset from the provided length and lock the memory region [2]
to prevent concurrent access to the page. The global current
at [1]
represents the current executing task_struct
(i.e., the process invoking mseal
). The referenced field is the mm_struct
representing the task’s entire virtual memory address space. The critical field in mm_struct
on which this syscall will operate is mmap
, a list of vm_area_struct
values. This represents a single contiguous memory region created by mmap
, such as the stack or VDSO.
The check_mm_seal
call at [3]
ensures that the targeted memory map for sealing is a valid range by iterating over each VMA from current->mm
to test boundary correctness.
static int check_mm_seal(unsigned long start, unsigned long end) { struct vm_area_struct *vma; unsigned long nstart = start; VMA_ITERATOR(vmi, current->mm, start); /* going through each vma to check. */ for_each_vma_range(vmi, vma, end) { if (vma->vm_start > nstart) /* unallocated memory found. */ return -ENOMEM; if (vma->vm_end >= end) return 0; nstart = vma->vm_end; } return -ENOMEM; }
The magic happens in the apply_mm_seal
call [4]
, which walks over each VMA again and arranges for the targeted region to have an additional VM_SEALED
flag through the mseal_fixup
call:
static int apply_mm_seal(unsigned long start, unsigned long end) { // ... nstart = start; for_each_vma_range(vmi, vma, end) { int error; unsigned long tmp; vm_flags_t newflags; newflags = vma->vm_flags | VM_SEALED; tmp = vma->vm_end; if (tmp > end) tmp = end; error = mseal_fixup(vmi, vma, &prev, nstart, tmp, newflags); if (error) return error; nstart = vma_iter_end(&vmi); } return 0; }
To ensure that unwanted memory operations respect this new flag, the mseal
patchset adds VM_SEALED
checks to the following files:
mm/madvise.c | 12 +
mm/mmap.c | 31 +-
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 307 ++++
For instance, mprotect
and pkey_mprotect
will enforce this check when it eventually invokes mprotect_fixup
:
int mprotect_fixup(..., struct vm_area_struct *vma, ...) { // ... if (!can_modify_vma(vma)) return -EPERM; } // ... }
To determine whether the syscall should continue, can_modify_vma
—defined in mm/vma.h
—will test for the existence of VM_SEALED
in the specified vm_area_struct
:
static inline bool vma_is_sealed(struct vm_area_struct *vma) { return (vma->vm_flags & VM_SEALED); } /* * check if a vma is sealed for modification. * return true, if modification is allowed. */ static inline bool can_modify_vma(struct vm_area_struct *vma) { if (unlikely(vma_is_sealed(vma))) return false; return true; }
From the changes in other memory-management syscalls, we can determine the operations that are not permitted on a VMA after it is sealed:
- Changing permission bits with
mprotect
andpkey_mprotect
- Unmapping with
munmap
- Replacement of a sealed map with
mmap
(MAP_FIXED
) with another one that is mutable/unsealed - Expanding or shrinking its size with
mremap
. Shrinking to zero could create a refillable hole for a new mapping with no sealing, as it triggers an unmap altogether. - Migrating to a new destination with
mremap(MREMAP_MAYMOVE | MREMAP_FIXED)
. Note that sealing checks are imposed on both the source and destination VMAs. Also, the source VMA will be unmapped ifMREMAP_DONTUNMAP
is not supplied, but themunmap
sealing check will still apply. - Calling
madvise
with the following destructive flags
For now, one can invoke mseal
on a 6.10+ kernel through a direct syscall invocation. Here’s a basic wrapper implementation to help you get started:
#include <sys/syscall.h> #include #define MSEAL_SYSCALL 462 long mseal(unsigned long start, size_t len) { int page_size; uintptr_t page_aligned_start; /* how large a page should be on our system (default: 4096 bytes) */ page_size = getpagesize(); /* page align the VMA range we want to seal */ page_aligned_start = start & ~(page_size - 1); return syscall(MSEAL_SYSCALL, page_aligned_start, len, 0); }
What exploit techniques does mseal help mitigate?
From the disallowed operations, we can discern two particular exploit scenarios that memory sealing will prevent:
- Tampering with a VMA’s permissions. Notably, not allowing executable permissions to be set can stop the revival of shellcode-based attacks.
- “Hole-punching” through arbitrary unmapping/remapping of a memory region, mitigating data-only exploits that take advantage of refilling memory regions with attacker-controlled data.
Let’s examine these scenarios in more detail, and the defense-in-depth strategies developers can employ in their software implementations.
Hardening NX
Even with the continued existence of code reuse techniques like ROP, attackers may prefer to gain shellcoding capability during exploitation; this can provide a stable and “easy win,” especially if constraints are imposed on the gadget chain. Here is a potential workflow to achieve this:
- Through some target functionality, spray shellcode onto a non-executable stack/heap region.
- Exploit the target’s bug to kick off an initial ROP chain to call
mprotect
withPROT_EXEC
to target the region holding the shellcode and turn off the NX bit. - Jump to it to revive old-school shellcoding!
The exploit for CVE-2018-7445 targeting Mikrotik RouterOS’s SMB daemon is a notable example. A socket-based shellcode is sprayed onto the non-executable heap, and the crafted ROP chain from a stack overflow modifies heap memory permissions before executing shellcode.
The most straightforward use case for memory sealing is disallowing VMA permission modification; once that happens, exploits that want to take advantage of traditional shellcode won’t be able to switch off executable bits.
As mentioned, mseal
will be introduced in glibc 2.41+, where the dynamic loader will apply sealing across a predetermined set of VMAs. However, at the time of writing, this will not be done automatically for the stack or heap.
This is expected because these regions can expand during runtime. For instance, a heap allocator that wants to reclaim space will invoke the brk
syscall, which could call arch_unmap
and eventually do_vmi_unmap
to perform shrinking. Of course, this would be disallowed under sealing and thus break dynamic memory allocation for the application altogether.
So, for now, the software developer is responsible for protecting these regions, as they have the context to determine when and where sealing should be applied appropriately.
Let’s use mseal
to enhance the stack’s old-school NX (non-executable) protection. Here’s a simple example that emulates the scenario mentioned above:
int main(void) { /* represents the stack that now contains /bin/sh shellcode we somehow sprayed */ unsigned char exec_shellcode[] = "\xe1\x45\x8c\xd2\x21\xcd\xad\xf2\xe1\x65\xce\xf2\x01\x0d\xe0\xf2" "\xe1\x8f\x1f\xf8\xe1\x03\x1f\xaa\xe2\x03\x1f\xaa\xe0\x63\x21\x8b" "\xa8\x1b\x80\xd2\xe1\x66\x02\xd4"; // vulnerability triggered, hijacked instruction pointer /* ======= what our ROP chain would do: ======= */ /* compute the start of the page for the shellcode */ void (*exec_ptr)() = (void(*)())&exec_shellcode; void *exec_offset = (void *)((int64_t) exec_ptr & ~(getpagesize() - 1)); mprotect(exec_offset, getpagesize(), PROT_READ|PROT_WRITE|PROT_EXEC); /* this now works! */ exec_ptr(); return 0; }
As we’d expect, setting PROT_EXEC
on the VMA permits exec_shellcode
to become executable again:
~ gcc stack_no_sealing.c -o stack_no_sealing
~ ./stack_no_sealing
$
Let’s introduce memory sealing on the stack-based exec_offset
VMA range:
int main(void) { /* represents the stack that now contains /bin/sh shellcode we somehow sprayed */ unsigned char exec_shellcode[] = "\xe1\x45\x8c\xd2\x21\xcd\xad\xf2\xe1\x65\xce\xf2\x01\x0d\xe0\xf2" "\xe1\x8f\x1f\xf8\xe1\x03\x1f\xaa\xe2\x03\x1f\xaa\xe0\x63\x21\x8b" "\xa8\x1b\x80\xd2\xe1\x66\x02\xd4"; /* compute the start of the page for the shellcode */ void (*exec_ptr)() = (void(*)())&exec_shellcode; void *exec_offset = (void *)((int64_t) exec_ptr & ~(getpagesize() - 1)); /* seal the stack page containing the shellcode! */ if (mseal(exec_offset, getpagesize()) < 0) handle_error("mseal"); // vulnerability triggered, hijacked instruction pointer /* ======= what our ROP chain would do: ======= */ mprotect(exec_offset, getpagesize(), PROT_READ|PROT_WRITE|PROT_EXEC); /* segfault now, as no permission change actually occurred */ exec_ptr(); return 0; }
The aforementioned can_modify_vma
check kicks in when mprotect
is called, preventing the permission change from ever happening, and the attempt to shellcode now fails:
~ gcc stack_with_sealing.c -o stack_with_sealing
~ ./stack_with_sealing
[1] 48771 segmentation fault (core dumped) ./stack_with_sealing
A simple strategy to accommodate real-world software could involve sparingly introducing a macro-ized version of the mseal
code snippet and iteratively sealing pages in select stack frames where untrusted data could reside for exploitation:
#define SIMPLE_HARDEN_NX_SINGLE_PAGE(frame) \ do { \ void *frame_offset = (void *)((int64_t) &frame & ~(getpagesize() - 1)); \ if (mseal(frame_offset, getpagesize()) == -1) { \ handle_error("mseal"); \ } \ } while(0) int frame_2(void) { int frame_start = 0; unsigned char another_untrusted_buffer[1024] = { 0 }; SIMPLE_HARDEN_NX_SINGLE_PAGE(frame_start); return 0; } int frame_1(void) { unsigned char untrusted_buffer[1024] = { 0 }; SIMPLE_HARDEN_NX_SINGLE_PAGE(untrusted_buffer); return frame_2(); }
Even if a sealed VMA is reused as a frame for another function with sealing logic, invoking mseal
again would be considered a no-op, so no errors would emerge. Of course, developers should be mindful of edge cases like automatic stack expansion from aggressive usage or bespoke features like stack splitting.
Hopefully, as the integration of mseal
into glibc continues, we’ll see tunables emerge that do not require any manual use of the syscall for the stack. Commenters in the LWN mailing list yearn for an automatic sealing that can be toggled for simpler applications.
And with all this said, if an attacker doesn’t want to fully ROP and insists on bringing back shellcode nostalgia, they could always use their initial code reuse technique to mmap a fresh region that is executable. However, this is pretty laborious, as it now involves copying the exploit payload from a readable region to this new mapping.
Mitigating unmapping-based, data-only exploitation
Disallowing mprotect
also prevents a sealed region from becoming writable, which is valuable if there are data variables that, when modified, could enhance an exploit primitive. However, during the inception of mseal
, Chrome maintainers rationalized an easier and more powerful technique with the added benefit of circumventing CFI (control-flow integrity). They determined that if an attacker can pass a corrupted pointer to unmapping/remapping syscalls, they can “punch a hole” in memory that could be refilled with attacker-controlled data. This would not violate CFI guarantees, as forward- and backward-edge CFI would cover only tampered control-flow transitions (e.g., stack return addresses and function pointers).
This is incredibly enticing for a browser implementing a JIT compiler. V8’s Turbofan can create regions that switch between RW and RX, aiding the refill process and changing permissions. Thus, an attacker can take advantage of the JIT compilation process by emitting executable code from hot-path JavaScript into the unmapped region to overwrite critical data and then leverage modifications to yield code execution.
We argue this is a data-only exploitation technique, as it doesn’t involve directly hijacking control flow or requiring leaked pointers but rather tampering with particular data in memory that influences control flow to the attacker’s liking. In an era of mitigations like CFI, this has emerged as a pretty potent technique during exploitation. Thus, memory sealing can prevent these particular data-only techniques by disallowing hole-punching scenarios.
This particular data-only technique isn’t just for browsers with JIT compilers! A similar technique would be the House of Muney for userspace heap exploitation. As Max Dulin points out in his post, Qualys used this technique to perform a real-world exploit for an ancient bug in Qmail.
This technique relies on the fact that for huge allocated chunks (greater than the M_MAP_THRESHOLD
tunable), malloc
and free
will directly invoke mmap
and munmap
, respectively, with no intermediate freelists that cache any freed chunks (which helps greatly simplify exploitation). Since size metadata exists at the top of allocated chunks, tampering it to a different page size and freeing it would cause a munmap
on memory regions adjacent to the chunk. Dulin used the arbitrary munmap
to target the .gnu.hash
and .dynsym
regions and after refilling them with another larger mmap chunk, enabled the overwriting of a single, yet-to-be-resolved PLT entry, reviving a GOT overwrite-style attack!
Dulin has a very well-done and annotated PoC for this attack here. Here’s an abridged version that goes up to the point where the unmapping and refill occur:
#include #include #include #include // With this allocation size, // malloc is now equivalent to mmap // free is now equivalent to munmap #define THRESHOLD_SIZE 0x100000 int main() { long long *bottom, *top, *refill; bottom = malloc(THRESHOLD_SIZE); memset(bottom, 'B', THRESHOLD_SIZE); // [1] Allocation that we write into out-of-bounds from a prior chunk top = malloc(THRESHOLD_SIZE); memset(top, 'A', THRESHOLD_SIZE); // [2] Corrupts size field, ensuring page alignment + mmap bit is set // size to unmap = top + bottom + large arbitrary size int unmap_size = (0xfffffffd & top[-1]) + (0xfffffffd & bottom[-1]) + 0x14000; top[-1] = (unmap_size | 2); // Trigger munmap with corrupted chunk free(top); // [3] Refill with new and larger mmap chunk refill = malloc(0x5F0000); memset(refill, 'X', 0x5F0000); return 0; }
By the time we finish [1]
, we can see that the top
and bottom
chunks now exist in a separate mapping below the heap, separated by 4096-byte padding. Note the adjacent libc mapping at 0xfffff7df0000
:
At [2]
, we corrupt the size
field of the chunk to a much larger page size and ensure that the mmap
bit is set. When we break on the munmap
occurring in the free [3]
, the size
argument passed has been changed, allowing an unmap into the adjacent region!
After [3]
, this can be confirmed by examining the contents of the previous libc mapping at 0xfffff7df0000
, now partially overwritten with X
s:
This is a pretty nifty data-only technique that can operate even in the presence of CFI and does not require a prerequisite ASLR leak!
Luckily, the aforementioned set of VMAs in mseal
’s glibc integration is expected to automatically mitigate this without any developer intervention, as mapped binary code and dynamic libraries become sealed from any remap/unmapping tricks like this. For additional hardening, a developer can selectively seal mmap allocations that they know will never expand or become unmapped during the lifetime of their program. This will have the added benefit of preventing the previous exploit scenario if attacker-controlled data can be expected to be written into the mmap chunks and may become writable/executable.
Build stronger software with mseal
There are likely many other use cases and scenarios that we didn’t cover. After all, mseal
is the newest kid on the block in the Linux kernel! As the glibc integration completes and matures, we expect to see improved iterations for the syscall to meet particular demands, including fleshing out the ultimate use of the flags
parameter.
Hardening software is complex, as navigating and evaluating new security mitigations can be challenging in understanding the risk and reward payoff. If this blog post is interesting to you, check out some of our escapades into other security mitigations. If you’re seeking guidance in integrating mseal
or any other modern mitigations into your software, contact us!