Skip to content

Bound memory growth in TraceStore and token revocation state #182

Description

@dgenio

Summary

Add caps and expiry-based cleanup to the two unbounded in-memory structures:
TraceStore (plain dict, no cap) and HMACTokenProvider._revoked /
_principal_tokens (an acknowledged TODO), mirroring the eviction design
HandleStore already has.

Why this matters

Long-lived agent processes accumulate one trace per invocation and one revocation
entry per revoked token forever. That is a slow memory leak and a denial-of-service
vector in high-volume deployments, and it contradicts the polish elsewhere
(HandleStore has max_entries=10_000 plus lazy/interval eviction).

Current evidence

  • trace.py: TraceStore stores traces in a plain dict with no cap or eviction.
  • tokens.py:196: # TODO: consider TTL-based cleanup to bound growth over long-lived instances above _principal_tokens; _revoked is an unbounded set[str].
  • handles.py:17-32: documented eviction design (max_entries=10_000, _EVICT_INTERVAL=128, evict_expired()), proving the repo's preferred pattern.

External context

Not required for this issue.

Proposed implementation

  1. TraceStore: add max_entries (default generous, e.g. 10_000) with FIFO/oldest
    eviction and an evicted_count so audit consumers know truncation occurred;
    document interplay with persistence plans ([Feature] Pluggable persistence for TraceStore, HandleStore, and token revocation (SQLite + JSONL backends) #126, [Feature] Hash-chained, verifiable audit log with retention and export controls #127).
  2. HMACTokenProvider: revocation entries can be dropped once the underlying token's
    expires_at has passed (an expired token fails verification anyway). Track
    expiry alongside revoked ids and sweep on an interval, holding _revocation_lock.
  3. Reuse the HandleStore lazy + interval eviction pattern for consistency.

AI-agent execution notes

  • Inspect first: trace.py, tokens.py (revocation paths, _revocation_lock), handles.py (pattern to mirror), tests/test_trace.py, tests/test_tokens.py.
  • Safety edge case: never drop a revocation entry before the token expires — that would un-revoke a live token. Test this explicitly.
  • Determinism: eviction order must be deterministic.
  • Audit edge case: evicting traces loses audit data; the cap must be configurable and the eviction observable.

Acceptance criteria

  • TraceStore never exceeds its configured cap; eviction is oldest-first and counted.
  • A revoked, unexpired token still fails verification after any sweep.
  • Revocation entries for expired tokens are eventually removed.
  • Memory growth under a loop of grant/revoke is bounded (regression test).

Test plan

Unit tests with injected clocks for sweep timing and the revoked-but-unexpired edge
case; cap tests for TraceStore. Run make ci.

Documentation plan

Document retention defaults in docs/security.md and docs/architecture.md;
CHANGELOG Added/Fixed.

Migration and compatibility notes

Defaults chosen high enough that typical sessions are unaffected; deployments needing
infinite retention should adopt persistence (#126). Not expected to require migration.

Risks and tradeoffs

Trace eviction trades audit completeness for boundedness — make the cap loud
(warning on first eviction). Revocation sweep correctness depends on clock handling;
reuse the injectable-clock pattern.

Suggested labels

security, reliability, performance

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions