[doc][netebpfext] Add proposal for async operations in netebpfext#5189
[doc][netebpfext] Add proposal for async operations in netebpfext#5189matthewige wants to merge 23 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a design proposal document for introducing pend/complete-style asynchronous verdict processing to netebpfext (eBPF network extension), intended to enable external orchestrators (user-mode and/or kernel-mode) to complete deferred WFP decisions.
Changes:
- Introduces a new proposal doc describing an async “PEND/COMPLETE/CONTINUE” lifecycle built around a custom map and a
net_ebpf_ext_pend_operation()helper. - Documents expected control flow for PEND, COMPLETE, CONTINUE, and common failure/edge cases across multiple WFP layers.
- Outlines required
ebpfcoreplatform changes and a phasednetebpfextwork breakdown.
| [Supported WFP layers](#supported-wfp-layers)) must be able to | ||
| pend the current network operation and return control to WFP. | ||
| 2. An external async orchestrator must be able to complete the pended operation | ||
| with a verdict (PERMIT, BLOCK, or CONTINUE) at a later time. |
There was a problem hiding this comment.
Any constraints on "later"? How do you prevent a buggy ebpf program from pending and never completing, thus consuming all resources? Is there something the verifier can do to ensure safety?
There was a problem hiding this comment.
Good question. To be clear, there is no 100% guarantee here -- bounding completion latency is the async orchestrator's responsibility, by design. The extension provides backstops, not a guarantee:
- The pend map is bounded by
max_entries. Once full,pend()returns an error and the program must return a non-PEND verdict, so a buggy/stalled orchestrator degrades gracefully (no kernel resource exhaustion). - A per-entry stale-entry watchdog (driven by the timestamp recorded at pend time) reclaims entries whose orchestrator never came back.
- There is no verifier check that proves
COMPLETE will eventually be called, since that depends entirely on user-mode behavior; the verifier can't reason about it.
Added a forward-link from the requirement to Edge case 1 in 4228b62 so the design overview points readers at these backstops. Happy to add a more explicit verifier-side check if you have one in mind (e.g., 'program must call pend() before returning PEND').
There was a problem hiding this comment.
If there is no guarantee, I'm afraid this will degrade the value of ebpf itself. Would like to discuss this in a meeting. I didn't quite understand the point "A per-entry stale-entry watchdog (driven by the timestamp recorded at pend time) reclaims entries whose orchestrator never came back". If this is just saying that it's guarantee to complete within (say) 2 seconds, by failing (and cleaning up state) if not completed, then that would be a type of guarantee of safety.
- Fix TOC anchor for AUTH_CONNECT/AUTH_RECV_ACCEPT (slash maps to double dash) - Remove FwpsPendClassify reference in DATAGRAM_DATA section -- per the layer table, datagram uses ABSORB+reinject only, no WFP pend API - Fix typos (is is, the the) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add forward link from the COMPLETE requirement to Edge case 1, and call out that bounding completion latency is the orchestrator's responsibility -- the extension only provides backstops (stale-entry watchdog, bounded max_entries). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Note that re-invocation is asynchronous on a worker thread at PASSIVE_LEVEL, not synchronous inside COMPLETE, and not pinned to the original classifyFn CPU. Adds a forward-link to the CONTINUE flow section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- PERMIT -> PERMIT_SOFT/PERMIT_HARD/BLOCK across COMPLETE step body, callback validation list, and sequence-diagram labels. - Failure flow: REJECT only (remove contradictory PERMIT fallback). - Trim redundancy in helper description, 'Why threaded DPC?' blockquote, and CPU hot-unplug fallback (cross-ref canonical sources). - Add diagnostic trace note in 'Multiple attached programs and PEND'. - Service-crash text simplified to brief 'no active cleanup' wording. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address PR review feedback: the raw WSACMSGHDR* pointer would not be valid after the original classifyFn returns. Add a comment specifying that control_data is deep-copied into extension-owned memory at pend time so it is safe to access during completion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Minor issue: |
- Add extension-side stale-entry watchdog (layer (e)) as kernel-owned backstop; cite netebpfext connect-redirect precedent. - Tighten saved-state-for-CONTINUE: extension saves full helper-visible net_ebpf_sock_addr_t wrapper (hook_id, redirect_context, transport_endpoint_handle, process_id, access_information, original_context). - Add keying-note callout for continuation map (CONTINUE is consumer-defined; conn-tuple holds for ALE; per-packet layer keying deferred to per-layer design). - Soften 'cleanup orchestrator responsibility' to layered model (orchestrator primary; extension backstop). - Threaded-DPC: remove duplicated 'Load-bearing assumption' callout; reframe as 'Implementation validation' (engineering validation of reimplementation, not uncertainty about underlying DDK pattern). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document the best-effort KEVENT signal-at-end fence for layers where WFP delivers classifyFn at PASSIVE_LEVEL (TCP ALE_AUTH_CONNECT/ ALE_AUTH_RECV_ACCEPT). The threaded-DPC ordering proof relies on the IRQL barrier and does not transfer to PASSIVE; the fence narrows but does not close the race window with WFP's post-classify pipeline cleanup. Flagged as an open dependency for WFP-team confirmation. - New sub-section under Race B - Per-layer callout under AUTH_CONNECT / AUTH_RECV_ACCEPT - New netebpfext work-breakdown item for the fence implementation - Fix broken Race B anchor (missing third hyphen in slug for ' -> ') Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extension-driven periodic stale-entry purge (Edge case 1 back-stop) requires a kernel-mode handle to the custom map plus an enumeration entry point. Today the dispatch-table callbacks deliver per-key context only -- the provider has no map handle and no way to iterate independently of an inbound op. Added as ebpfcore platform requirement microsoft#5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ebpfcore platform requirements item 2 now includes the map-handle requirement and the extension-driven periodic stale-entry purge (folded in from a brief separate item 5 that was added then removed). - Drop empty '## WFP implementation requirements' section (was a one-line filler heading missing from the TOC; per-layer requirements live under '## Per-layer async design'). - Helper-name typo: pend_operation() -> bpf_pend_operation() in the implicit-program-context-accessor narrative (matches every other reference to the helper in this doc). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the WESP-side change: add originating_program to net_ebpf_ext_pend_internal_state_t so netebpfext can identify which program in a chain returned PEND when re-dispatching on CONTINUE. This was a real gap when the pend map is shared across multiple programs attached at the same attach point. Layer + attach params were already captured (layer_id, compartment_id) and chain semantics were already pinned down (PEND short-circuits at program N, aggregate_verdict from 1..N-1 preserved, only one program may PEND). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the WESP-side correction. The doc had implied that user-mode CRUD support was something that needed to be added -- in fact, user-mode and BPF-program-driven CRUD via the existing dispatch-table callbacks (preprocess_map_update_element, preprocess_map_delete_element, postprocess_map_find_element, postprocess_map_delete) is fully wired and works today. The actual remaining ebpfcore gap is extension-initiated kernel-mode CRUD APIs that let a custom-map provider drive operations on its own map from kernel mode (from inside an extension helper handler such as pend(), and from the threaded DPC where there is no in-flight user-mode or BPF caller to drive the dispatch). Updated the design-overview note and ebpfcore work-breakdown item 2 to reflect this -- the user-mode/BPF-helper paths are explicitly called out as 'no new wiring needed', and only the kernel-mode CRUD path is flagged as new ebpfcore work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror updates from the WESP async-processing PR round 4 walkthrough: - Synthesized COMPLETE: clarified how netebpfext identifies which entry to clean up via pend_id stored in private classifyFn context wrapper. - Multiple attached programs: documented detection of competing PEND-capable programs via a new MAY_PEND attach-params flag plus client-side enumeration of bpf_link_info.attach_data. - Smaller threads: padding/static_assert, netebpfext fail behavior. netebpfext work breakdown updated to track new attach-params extension item. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…odern standby + generic) New Edge case 4 documents two mitigation patterns for conditions that require the orchestrator to temporarily disallow new pends and force-drain already-pended operations (modern standby is the canonical example; in-place servicing and planned maintenance are others): - Pattern (a): pend control custom map (BPF_MAP_TYPE_NET_EBPF_EXT_PEND_CONTROL, size 1, singleton per extension instance, opt-in). Map is zero-initialized to PEND_STATE_DISABLED so the default is fail-closed; orchestrator affirms readiness with an explicit PEND_STATE_ENABLED write at startup. Helper short-circuits to NET_EBPF_EXT_PEND_ERROR_DISABLED while disabled; the ENABLED -> DISABLED transition queues a kernel-side threaded DPC drain. - Pattern (b): best-effort shared BPF_MAP_TYPE_ARRAY polling fallback. Race window for new pends; no kernel-side fail-safe drain; cheaper. Neutral trade-off framing: (a) actually closes the 9F surface, (b) significantly lower cost but leaves the problem unresolved; (a) can be layered on top of (b) incrementally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…CEPT to DISPATCH; narrow PASSIVE fence to AUTH_LISTEN Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…URITY_SUBJECT_CONTEXT) Documents the design for identity-aware programs that need a full PACCESS_TOKEN or captured SECURITY_SUBJECT_CONTEXT (rather than just the DISPATCH-safe TOKEN_ACCESS_INFORMATION blob from FWPS_INCOMING_VALUE_ALE_USER_ID). Key points: - DISPATCH-vs-PASSIVE applicability per WFP layer (token resolution via ObReferenceObjectByHandle / PsLookupProcessByProcessId is PASSIVE-only; AUTH_CONNECT and AUTH_RECV_ACCEPT are DISPATCH). - New extension-specific helpers bpf_get_access_token / bpf_get_subject_context that return NULL at DISPATCH when not pre-resolved. - New BPF_*_VERDICT_DEFER_TO_PASSIVE verdict + extension-driven PEND + threaded-DPC pinned to classifyFn CPU + worker re-invocation via the existing CONTINUE path. - PID-reuse detection via token-pointer equality (PsReferencePrimaryToken vs ObReferenceObjectByHandle of saved HANDLE) before SeCaptureSubjectContextEx is called. - PETHREAD limitation (Thread = NULL; matches WFP ALE access-check semantics). - Lifecycle / cleanup table; double-defer and PASSIVE-layer-defer fail-closed guards. Tracking issues: microsoft#5231 (parent), microsoft#5235 (bpf_get_access_token), microsoft#5236 (bpf_get_subject_context). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace two-helper design (bpf_get_access_token + bpf_get_subject_context)
with single bpf_get_identity() returning bpf_identity_info_t. Errno set
{0, -EAGAIN, -ENOENT, -EINVAL}.
- Classify-wrapper steps 1-11 cover PEND + DEFER_TO_PASSIVE re-invocation,
PID-reuse + token-pointer-equality check, atomic-snapshot policy.
- Add No-leak invariants (pend-entry path) section documenting structured
acquire/release pairing across all failure paths.
- Add Race windows R1-R10 with per-row mitigations.
- Add DATAGRAM_DATA / STREAM identity propagation: per-flow blob with
independent refs published via FwpsFlowAssociateContext0 + flowDeleteFn;
F1-F6 race table; open implementation questions.
- Add netebpfext work-breakdown item 10 (identity-aware programs).
- Reference single GH issue microsoft#5235 (supersedes microsoft#5236).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore subject clause in 'No-leak invariants (pend-entry path)' intro that lost its leading sentence in a prior edit. - Drop filler closing sentence in Custom-map subsection. - Drop 'Race protection: COMPLETE before pend API' callout in PEND flow; full argument lives in Race A and pend-API ordering note cross-references it. - Drop CONTINUE re-invocation Note that restated steps 3 and 5 of the same flow verbatim. - Drop section-preview sentence in Identity-aware programs intro (Design overview enumerates the same content one section below). - Drop heading-restating opener sentences in Per-layer async design and Async orchestrator integration guide section intros. - Tighten orchestrator-guide 'COMPLETE and cleanup' subsection to cross-references; full stale-entry cleanup design is in Edge case 1. No technical detail removed; all design rationale, race tables, ownership/lifetime invariants, and per-layer specifics are preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the corresponding refinements from the internal WESP design doc: - Add Anti-loop guards callout in 'New verdict: defer to PASSIVE' (paired helper/verdict guards prevent runaway DISPATCH<->PASSIVE loops). - Correct the MAY_PEND wording: the user-mode loader sets the existing MAY_PEND attach flag at bpf_link_create time for programs that use bpf_get_identity(); it is not auto-stamped from bytecode. Note that netebpfext treats DEFER_TO_PASSIVE as a chain-terminating verdict identical to PEND. - Compress the DATAGRAM_DATA / STREAM identity-propagation section to a deferred-design note (status, constraint, required SET/GET flow-context mechanism, and an implementation note that the flow blob must hold its own independent refs rather than transferring ownership from the pend entry). The full design is deferred until those layers land in netebpfext. - Remove the Edge cases subsection: the PETHREAD-not-available constraint is inlined into the helper prose; remaining bullets duplicated the race table, anti-loop guards, helper prototype docs, or chain-aggregation paragraphs. - Reposition the DEFER_TO_PASSIVE sequence diagram as a wrap-up after Race windows, with notes showing the per-invocation 'do I actually need identity for this decision?' check and the no-defer fast path. - Trim verifier-rationale prose around the bpf_get_identity prototype. - Add a consolidated DATAGRAM_DATA / STREAM layer-support work item to the netebpfext work breakdown, gathering pend/complete, identity propagation (independent refs), and program-adaptation requirements. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Correct preprocess_map_update_element locking story: callback runs outside the per-bucket lock; rewrite serialization story around single-logical-writer-per-entry + per-value atomicity + single-winner-delete provided by the bucket lock. - Rewrite CONTINUE-flow inline-reinvocation rationale (no longer a deadlock claim; now framed as re-entrancy concern). - Switch pend_key struct from explicit eserved+C_ASSERT to `#pragma pack(push,1)` for layout determinism. - Fix bucket-allocation comments on pend_map and continuation_map `max_entries`: bucket array IS allocated up front; size to expected concurrent pend count, not UINT32_MAX. - Add bidirectional forward-compat safety paragraph. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| Network callout drivers often need to defer a verdict on a connection or | ||
| packet while waiting for an asynchronous decision from another component | ||
| -- for example, a user-mode policy service or a kernel-mode classification | ||
| driver. The Windows Filtering Platform (WFP) provides several async | ||
| mechanisms at different layers (`FwpsPendOperation` / | ||
| `FwpsCompleteOperation` at ALE authorize layers, `FwpsPendClassify` / | ||
| `FwpsCompleteClassify` at resource assignment, ABSORB+reinject at | ||
| datagram, DEFER/OOB at stream), but eBPF programs running through |
There was a problem hiding this comment.
| Network callout drivers often need to defer a verdict on a connection or | |
| packet while waiting for an asynchronous decision from another component | |
| -- for example, a user-mode policy service or a kernel-mode classification | |
| driver. The Windows Filtering Platform (WFP) provides several async | |
| mechanisms at different layers (`FwpsPendOperation` / | |
| `FwpsCompleteOperation` at ALE authorize layers, `FwpsPendClassify` / | |
| `FwpsCompleteClassify` at resource assignment, ABSORB+reinject at | |
| datagram, DEFER/OOB at stream), but eBPF programs running through | |
| Network security applications often need to defer a verdict on a connection or | |
| packet while waiting for an asynchronous decision from another component | |
| -- for example, a policy service. But eBPF programs running through |
| 1. An eBPF program attached to a supported WFP hook point (see | ||
| [Supported WFP layers](#supported-wfp-layers)) must be able to | ||
| pend the current network operation and return control to WFP. |
There was a problem hiding this comment.
| 1. An eBPF program attached to a supported WFP hook point (see | |
| [Supported WFP layers](#supported-wfp-layers)) must be able to | |
| pend the current network operation and return control to WFP. | |
| 1. An eBPF program attached to SOCK_ADDR attach points must be able to | |
| pend the current network operation and return a new verdict type. |
We must avoid mentioning WFP in the context of the BPF program. The "API" is eBPF hook and not WFP.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Design proposal for asynchronous processing of verdicts in
netebpfext. Addsdocs/AsyncProcessing.mdcovering how WFPclassifyFncallouts can PEND, return to WFP, and later COMPLETE from another context, including how this composes with eBPF program invocation, custom maps, and the existing ebpfcore extension contract.Closes #5188.
This is a documentation-only PR -- no code or test changes.
Highlights
bpf_pend_operation()/bpf_complete_operation(), with a per-bucket lock and an explicitPENDING->COMPLETING->COMPLETEDlifecycle state machine.ALE_AUTH_CONNECT_V{4,6}(synchronous reauth) andALE_AUTH_RECV_ACCEPT_V{4,6}(directFwpsCompleteOperationfrom ebpfext context).AUTH_*deferred completion, and a best-effort PASSIVE-level fence around in-classifyFn completion (Race B).Testing
n/a -- documentation only.
Documentation
This PR is the documentation.
Installation
n/a.