Skip to content

[doc][netebpfext] Add proposal for async operations in netebpfext#5189

Draft
matthewige wants to merge 23 commits into
microsoft:mainfrom
matthewige:user/maige/pend_proposal
Draft

[doc][netebpfext] Add proposal for async operations in netebpfext#5189
matthewige wants to merge 23 commits into
microsoft:mainfrom
matthewige:user/maige/pend_proposal

Conversation

@matthewige
Copy link
Copy Markdown
Contributor

@matthewige matthewige commented Apr 22, 2026

Description

Design proposal for asynchronous processing of verdicts in netebpfext. Adds docs/AsyncProcessing.md covering how WFP classifyFn callouts can PEND, return to WFP, and later COMPLETE from another context, including how this composes with eBPF program invocation, custom maps, and the existing ebpfcore extension contract.

Closes #5188.

This is a documentation-only PR -- no code or test changes.

Highlights

  • Pend / complete model built on a custom continuation map plus the extension helpers bpf_pend_operation() / bpf_complete_operation(), with a per-bucket lock and an explicit PENDING -> COMPLETING -> COMPLETED lifecycle state machine.
  • Layer-specific integration for ALE_AUTH_CONNECT_V{4,6} (synchronous reauth) and ALE_AUTH_RECV_ACCEPT_V{4,6} (direct FwpsCompleteOperation from ebpfext context).
  • Threaded-DPC unwinding contract for AUTH_* deferred completion, and a best-effort PASSIVE-level fence around in-classifyFn completion (Race B).
  • ebpfcore platform requirements consolidated, including kernel-mode CRUD APIs for custom maps and a provider-side map handle issued at registration so extensions can drive periodic stale-entry cleanup through the same CRUD path.
  • netebpfext work breakdown for the PEND helpers, the threaded-DPC unwinding glue, and the PASSIVE fence.

Testing

n/a -- documentation only.

Documentation

This PR is the documentation.

Installation

n/a.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design proposal document for introducing pend/complete-style asynchronous verdict processing to netebpfext (eBPF network extension), intended to enable external orchestrators (user-mode and/or kernel-mode) to complete deferred WFP decisions.

Changes:

  • Introduces a new proposal doc describing an async “PEND/COMPLETE/CONTINUE” lifecycle built around a custom map and a net_ebpf_ext_pend_operation() helper.
  • Documents expected control flow for PEND, COMPLETE, CONTINUE, and common failure/edge cases across multiple WFP layers.
  • Outlines required ebpfcore platform changes and a phased netebpfext work breakdown.

Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
[Supported WFP layers](#supported-wfp-layers)) must be able to
pend the current network operation and return control to WFP.
2. An external async orchestrator must be able to complete the pended operation
with a verdict (PERMIT, BLOCK, or CONTINUE) at a later time.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any constraints on "later"? How do you prevent a buggy ebpf program from pending and never completing, thus consuming all resources? Is there something the verifier can do to ensure safety?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. To be clear, there is no 100% guarantee here -- bounding completion latency is the async orchestrator's responsibility, by design. The extension provides backstops, not a guarantee:

  1. The pend map is bounded by max_entries. Once full, pend() returns an error and the program must return a non-PEND verdict, so a buggy/stalled orchestrator degrades gracefully (no kernel resource exhaustion).
  2. A per-entry stale-entry watchdog (driven by the timestamp recorded at pend time) reclaims entries whose orchestrator never came back.
  3. There is no verifier check that proves COMPLETE will eventually be called, since that depends entirely on user-mode behavior; the verifier can't reason about it.

Added a forward-link from the requirement to Edge case 1 in 4228b62 so the design overview points readers at these backstops. Happy to add a more explicit verifier-side check if you have one in mind (e.g., 'program must call pend() before returning PEND').

Copy link
Copy Markdown
Collaborator

@dthaler dthaler May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no guarantee, I'm afraid this will degrade the value of ebpf itself. Would like to discuss this in a meeting. I didn't quite understand the point "A per-entry stale-entry watchdog (driven by the timestamp recorded at pend time) reclaims entries whose orchestrator never came back". If this is just saying that it's guarantee to complete within (say) 2 seconds, by failing (and cleaning up state) if not completed, then that would be a type of guarantee of safety.

Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
matthewige and others added 5 commits April 28, 2026 12:12
- Fix TOC anchor for AUTH_CONNECT/AUTH_RECV_ACCEPT (slash maps to double dash)

- Remove FwpsPendClassify reference in DATAGRAM_DATA section -- per the layer table, datagram uses ABSORB+reinject only, no WFP pend API

- Fix typos (is is, the the)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add forward link from the COMPLETE requirement to Edge case 1, and call out that bounding completion latency is the orchestrator's responsibility -- the extension only provides backstops (stale-entry watchdog, bounded max_entries).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Note that re-invocation is asynchronous on a worker thread at PASSIVE_LEVEL, not synchronous inside COMPLETE, and not pinned to the original classifyFn CPU. Adds a forward-link to the CONTINUE flow section.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- PERMIT -> PERMIT_SOFT/PERMIT_HARD/BLOCK across COMPLETE step body,
  callback validation list, and sequence-diagram labels.
- Failure flow: REJECT only (remove contradictory PERMIT fallback).
- Trim redundancy in helper description, 'Why threaded DPC?' blockquote,
  and CPU hot-unplug fallback (cross-ref canonical sources).
- Add diagnostic trace note in 'Multiple attached programs and PEND'.
- Service-crash text simplified to brief 'no active cleanup' wording.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address PR review feedback: the raw WSACMSGHDR* pointer would not be
valid after the original classifyFn returns. Add a comment specifying
that control_data is deep-copied into extension-owned memory at pend
time so it is safe to access during completion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Alan-Jowett
Copy link
Copy Markdown
Member

Minor issue:
Not all of the mermaid diagrams render correctly in GitHub ui.

Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
Comment thread docs/AsyncProcessing.md Outdated
matthewige and others added 2 commits May 1, 2026 07:42
- Add extension-side stale-entry watchdog (layer (e)) as kernel-owned
  backstop; cite netebpfext connect-redirect precedent.
- Tighten saved-state-for-CONTINUE: extension saves full helper-visible
  net_ebpf_sock_addr_t wrapper (hook_id, redirect_context,
  transport_endpoint_handle, process_id, access_information,
  original_context).
- Add keying-note callout for continuation map (CONTINUE is
  consumer-defined; conn-tuple holds for ALE; per-packet layer keying
  deferred to per-layer design).
- Soften 'cleanup orchestrator responsibility' to layered model
  (orchestrator primary; extension backstop).
- Threaded-DPC: remove duplicated 'Load-bearing assumption' callout;
  reframe as 'Implementation validation' (engineering validation of
  reimplementation, not uncertainty about underlying DDK pattern).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document the best-effort KEVENT signal-at-end fence for layers where
WFP delivers classifyFn at PASSIVE_LEVEL (TCP ALE_AUTH_CONNECT/
ALE_AUTH_RECV_ACCEPT). The threaded-DPC ordering proof relies on
the IRQL barrier and does not transfer to PASSIVE; the fence narrows
but does not close the race window with WFP's post-classify pipeline
cleanup. Flagged as an open dependency for WFP-team confirmation.

- New sub-section under Race B
- Per-layer callout under AUTH_CONNECT / AUTH_RECV_ACCEPT
- New netebpfext work-breakdown item for the fence implementation
- Fix broken Race B anchor (missing third hyphen in slug for ' -> ')

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
matthewige and others added 12 commits May 1, 2026 10:14
Extension-driven periodic stale-entry purge (Edge case 1 back-stop)
requires a kernel-mode handle to the custom map plus an enumeration
entry point. Today the dispatch-table callbacks deliver per-key
context only -- the provider has no map handle and no way to iterate
independently of an inbound op. Added as ebpfcore platform
requirement microsoft#5.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ebpfcore platform requirements item 2 now includes the map-handle
  requirement and the extension-driven periodic stale-entry purge
  (folded in from a brief separate item 5 that was added then
  removed).
- Drop empty '## WFP implementation requirements' section (was a
  one-line filler heading missing from the TOC; per-layer
  requirements live under '## Per-layer async design').
- Helper-name typo: pend_operation() -> bpf_pend_operation() in the
  implicit-program-context-accessor narrative (matches every other
  reference to the helper in this doc).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the WESP-side change: add originating_program to
net_ebpf_ext_pend_internal_state_t so netebpfext can identify which
program in a chain returned PEND when re-dispatching on CONTINUE.

This was a real gap when the pend map is shared across multiple
programs attached at the same attach point. Layer + attach params
were already captured (layer_id, compartment_id) and chain semantics
were already pinned down (PEND short-circuits at program N,
aggregate_verdict from 1..N-1 preserved, only one program may PEND).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the WESP-side correction. The doc had implied that user-mode CRUD
support was something that needed to be added -- in fact, user-mode and
BPF-program-driven CRUD via the existing dispatch-table callbacks
(preprocess_map_update_element, preprocess_map_delete_element,
postprocess_map_find_element, postprocess_map_delete) is fully wired and
works today.

The actual remaining ebpfcore gap is extension-initiated kernel-mode
CRUD APIs that let a custom-map provider drive operations on its own
map from kernel mode (from inside an extension helper handler such as
pend(), and from the threaded DPC where there is no in-flight user-mode
or BPF caller to drive the dispatch).

Updated the design-overview note and ebpfcore work-breakdown item 2 to
reflect this -- the user-mode/BPF-helper paths are explicitly called
out as 'no new wiring needed', and only the kernel-mode CRUD path is
flagged as new ebpfcore work.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror updates from the WESP async-processing PR round 4 walkthrough:

- Synthesized COMPLETE: clarified how netebpfext identifies which entry to clean up via pend_id stored in private classifyFn context wrapper.
- Multiple attached programs: documented detection of competing PEND-capable programs via a new MAY_PEND attach-params flag plus client-side enumeration of bpf_link_info.attach_data.
- Smaller threads: padding/static_assert, netebpfext fail behavior.

netebpfext work breakdown updated to track new attach-params extension item.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…odern standby + generic)

New Edge case 4 documents two mitigation patterns for conditions that require
the orchestrator to temporarily disallow new pends and force-drain already-pended
operations (modern standby is the canonical example; in-place servicing and
planned maintenance are others):

- Pattern (a): pend control custom map (BPF_MAP_TYPE_NET_EBPF_EXT_PEND_CONTROL,
  size 1, singleton per extension instance, opt-in). Map is zero-initialized to
  PEND_STATE_DISABLED so the default is fail-closed; orchestrator affirms
  readiness with an explicit PEND_STATE_ENABLED write at startup. Helper
  short-circuits to NET_EBPF_EXT_PEND_ERROR_DISABLED while disabled; the
  ENABLED -> DISABLED transition queues a kernel-side threaded DPC drain.
- Pattern (b): best-effort shared BPF_MAP_TYPE_ARRAY polling fallback.
  Race window for new pends; no kernel-side fail-safe drain; cheaper.

Neutral trade-off framing: (a) actually closes the 9F surface, (b) significantly
lower cost but leaves the problem unresolved; (a) can be layered on top of (b)
incrementally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…CEPT to DISPATCH; narrow PASSIVE fence to AUTH_LISTEN

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…URITY_SUBJECT_CONTEXT)

Documents the design for identity-aware programs that need a full
PACCESS_TOKEN or captured SECURITY_SUBJECT_CONTEXT (rather than just
the DISPATCH-safe TOKEN_ACCESS_INFORMATION blob from
FWPS_INCOMING_VALUE_ALE_USER_ID).

Key points:
- DISPATCH-vs-PASSIVE applicability per WFP layer (token resolution
  via ObReferenceObjectByHandle / PsLookupProcessByProcessId is
  PASSIVE-only; AUTH_CONNECT and AUTH_RECV_ACCEPT are DISPATCH).
- New extension-specific helpers bpf_get_access_token /
  bpf_get_subject_context that return NULL at DISPATCH when not
  pre-resolved.
- New BPF_*_VERDICT_DEFER_TO_PASSIVE verdict + extension-driven PEND
  + threaded-DPC pinned to classifyFn CPU + worker re-invocation via
  the existing CONTINUE path.
- PID-reuse detection via token-pointer equality
  (PsReferencePrimaryToken vs ObReferenceObjectByHandle of saved
  HANDLE) before SeCaptureSubjectContextEx is called.
- PETHREAD limitation (Thread = NULL; matches WFP ALE access-check
  semantics).
- Lifecycle / cleanup table; double-defer and PASSIVE-layer-defer
  fail-closed guards.

Tracking issues: microsoft#5231 (parent), microsoft#5235 (bpf_get_access_token),
microsoft#5236 (bpf_get_subject_context).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace two-helper design (bpf_get_access_token + bpf_get_subject_context)
  with single bpf_get_identity() returning bpf_identity_info_t. Errno set
  {0, -EAGAIN, -ENOENT, -EINVAL}.
- Classify-wrapper steps 1-11 cover PEND + DEFER_TO_PASSIVE re-invocation,
  PID-reuse + token-pointer-equality check, atomic-snapshot policy.
- Add No-leak invariants (pend-entry path) section documenting structured
  acquire/release pairing across all failure paths.
- Add Race windows R1-R10 with per-row mitigations.
- Add DATAGRAM_DATA / STREAM identity propagation: per-flow blob with
  independent refs published via FwpsFlowAssociateContext0 + flowDeleteFn;
  F1-F6 race table; open implementation questions.
- Add netebpfext work-breakdown item 10 (identity-aware programs).
- Reference single GH issue microsoft#5235 (supersedes microsoft#5236).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore subject clause in 'No-leak invariants (pend-entry path)'
  intro that lost its leading sentence in a prior edit.
- Drop filler closing sentence in Custom-map subsection.
- Drop 'Race protection: COMPLETE before pend API' callout in PEND
  flow; full argument lives in Race A and pend-API ordering note
  cross-references it.
- Drop CONTINUE re-invocation Note that restated steps 3 and 5 of the
  same flow verbatim.
- Drop section-preview sentence in Identity-aware programs intro
  (Design overview enumerates the same content one section below).
- Drop heading-restating opener sentences in Per-layer async design
  and Async orchestrator integration guide section intros.
- Tighten orchestrator-guide 'COMPLETE and cleanup' subsection to
  cross-references; full stale-entry cleanup design is in Edge case 1.

No technical detail removed; all design rationale, race tables,
ownership/lifetime invariants, and per-layer specifics are preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the corresponding refinements from the internal WESP design doc:

- Add Anti-loop guards callout in 'New verdict: defer to PASSIVE' (paired helper/verdict guards prevent runaway DISPATCH<->PASSIVE loops).

- Correct the MAY_PEND wording: the user-mode loader sets the existing MAY_PEND attach flag at bpf_link_create time for programs that use bpf_get_identity(); it is not auto-stamped from bytecode. Note that netebpfext treats DEFER_TO_PASSIVE as a chain-terminating verdict identical to PEND.

- Compress the DATAGRAM_DATA / STREAM identity-propagation section to a deferred-design note (status, constraint, required SET/GET flow-context mechanism, and an implementation note that the flow blob must hold its own independent refs rather than transferring ownership from the pend entry). The full design is deferred until those layers land in netebpfext.

- Remove the Edge cases subsection: the PETHREAD-not-available constraint is inlined into the helper prose; remaining bullets duplicated the race table, anti-loop guards, helper prototype docs, or chain-aggregation paragraphs.

- Reposition the DEFER_TO_PASSIVE sequence diagram as a wrap-up after Race windows, with notes showing the per-invocation 'do I actually need identity for this decision?' check and the no-defer fast path.

- Trim verifier-rationale prose around the bpf_get_identity prototype.

- Add a consolidated DATAGRAM_DATA / STREAM layer-support work item to the netebpfext work breakdown, gathering pend/complete, identity propagation (independent refs), and program-adaptation requirements.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Correct preprocess_map_update_element locking story: callback runs
  outside the per-bucket lock; rewrite serialization story around
  single-logical-writer-per-entry + per-value atomicity +
  single-winner-delete provided by the bucket lock.
- Rewrite CONTINUE-flow inline-reinvocation rationale (no longer a
  deadlock claim; now framed as re-entrancy concern).
- Switch pend_key struct from explicit 
eserved+C_ASSERT to
  `#pragma pack(push,1)` for layout determinism.
- Fix bucket-allocation comments on pend_map and continuation_map
  `max_entries`: bucket array IS allocated up front; size to
  expected concurrent pend count, not UINT32_MAX.
- Add bidirectional forward-compat safety paragraph.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread docs/AsyncProcessing.md Outdated
Comment on lines +32 to +39
Network callout drivers often need to defer a verdict on a connection or
packet while waiting for an asynchronous decision from another component
-- for example, a user-mode policy service or a kernel-mode classification
driver. The Windows Filtering Platform (WFP) provides several async
mechanisms at different layers (`FwpsPendOperation` /
`FwpsCompleteOperation` at ALE authorize layers, `FwpsPendClassify` /
`FwpsCompleteClassify` at resource assignment, ABSORB+reinject at
datagram, DEFER/OOB at stream), but eBPF programs running through
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Network callout drivers often need to defer a verdict on a connection or
packet while waiting for an asynchronous decision from another component
-- for example, a user-mode policy service or a kernel-mode classification
driver. The Windows Filtering Platform (WFP) provides several async
mechanisms at different layers (`FwpsPendOperation` /
`FwpsCompleteOperation` at ALE authorize layers, `FwpsPendClassify` /
`FwpsCompleteClassify` at resource assignment, ABSORB+reinject at
datagram, DEFER/OOB at stream), but eBPF programs running through
Network security applications often need to defer a verdict on a connection or
packet while waiting for an asynchronous decision from another component
-- for example, a policy service. But eBPF programs running through

Comment thread docs/AsyncProcessing.md Outdated
Comment on lines +60 to +62
1. An eBPF program attached to a supported WFP hook point (see
[Supported WFP layers](#supported-wfp-layers)) must be able to
pend the current network operation and return control to WFP.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. An eBPF program attached to a supported WFP hook point (see
[Supported WFP layers](#supported-wfp-layers)) must be able to
pend the current network operation and return control to WFP.
1. An eBPF program attached to SOCK_ADDR attach points must be able to
pend the current network operation and return a new verdict type.

We must avoid mentioning WFP in the context of the BPF program. The "API" is eBPF hook and not WFP.

@shankarseal shankarseal moved this from Todo to In Progress in eBPF for Windows Triage May 11, 2026
@matthewige matthewige marked this pull request as draft May 12, 2026 19:12
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[Netebpfext] Add async (pend/complete) support - Documentation Needed

6 participants