Skip to content

Latest commit

 

History

History
168 lines (109 loc) · 4.88 KB

File metadata and controls

168 lines (109 loc) · 4.88 KB

Architecture Decision Records

1. SSE (Server-Sent Events) for Triage Streaming

Decision: Use SSE instead of WebSockets for real-time triage updates.

Rationale:

  • Unidirectional communication (server → client) is sufficient
  • Built-in auto-reconnection with EventSource
  • Works seamlessly with HTTP/2 multiplexing
  • Simpler than WebSocket (no handshake complexity)
  • Better for read-heavy streaming scenarios

Trade-offs: No bidirectional communication, but not needed for our use case.


2. Keyset Pagination over Offset-based

Decision: Implement cursor-based (keyset) pagination using (customerId, timestamp) composite key.

Rationale:

  • Stable results even when data changes (no page drift)
  • O(1) performance regardless of page depth
  • Better for large datasets (1M+ rows)
  • Prevents duplicate/missing items on concurrent updates

Trade-offs: Cannot jump to arbitrary page numbers, but forward/backward navigation works well.


3. Circuit Breaker Pattern for External Tools

Decision: Implement circuit breaker with 30s timeout after 3 consecutive failures per agent.

Rationale:

  • Prevents cascading failures when risk/fraud APIs are down
  • Allows system to degrade gracefully with fallbacks
  • Auto-recovery after timeout period
  • Protects downstream services from overload

Trade-offs: Brief service degradation during recovery window.


4. Deterministic Fallbacks for All Agents

Decision: Every agent has rule-based fallback logic (no LLM dependency required).

Rationale:

  • System works offline without external API calls
  • Predictable behavior for testing and evaluation
  • Faster response times (no network latency)
  • Compliance-friendly (no data leaves infrastructure)

Trade-offs: Less sophisticated insights compared to LLM-powered analysis.


5. Virtual Scrolling for Large Tables

Decision: Use TanStack Virtual for alert/transaction tables (2k+ rows).

Rationale:

  • Renders only visible rows (~20-30 DOM nodes vs 2000+)
  • Eliminates scroll jank and memory bloat
  • Maintains 60fps scrolling performance
  • Works with dynamic row heights

Trade-offs: Slight complexity in implementation, but huge performance gain.


6. Prisma ORM over Raw SQL

Decision: Use Prisma for type-safe database access with TypeScript.

Rationale:

  • Compile-time type safety (no runtime query errors)
  • Auto-generated types from schema
  • Migration management built-in
  • Developer productivity (autocomplete, refactoring)

Trade-offs: Slight performance overhead vs raw SQL, but negligible for our scale.


7. Redis for Rate Limiting and Caching

Decision: Implement token bucket rate limiter in Redis (5 req/sec per client).

Rationale:

  • Distributed state across API instances
  • Atomic operations (INCR, EXPIRE) prevent race conditions
  • Sub-millisecond latency for checks
  • TTL-based cleanup (no manual garbage collection)

Trade-offs: Additional infrastructure dependency, but essential for multi-instance deployments.


8. Idempotency Keys for Mutations

Decision: Require Idempotency-Key header for all state-changing operations.

Rationale:

  • Prevents duplicate actions on network retries
  • Safe to retry failed requests
  • Audit trail links multiple attempts to same logical action
  • Industry best practice (Stripe, Twilio, etc.)

Trade-offs: Clients must generate unique keys, but prevents costly mistakes.


9. PII Redaction Pipeline

Decision: Redact PAN (13-19 digit sequences) and mask emails in all logs/traces/UI.

Rationale:

  • PCI-DSS compliance requirement
  • Defense-in-depth (multiple layers of redaction)
  • Prevents accidental exposure in logs/monitoring
  • Required for audit trail security

Trade-offs: Cannot reconstruct original data from logs (intentional).


10. Prometheus Metrics over Custom Solution

Decision: Export metrics in Prometheus format via /metrics endpoint.

Rationale:

  • Industry standard for observability
  • Rich ecosystem (Grafana, AlertManager, etc.)
  • Pull-based model (no client config needed)
  • Built-in aggregation and alerting

Trade-offs: Requires Prometheus server for visualization, but widely adopted.


11. Docker Compose for Local Development

Decision: Single docker-compose.yml brings up all services (Postgres, Redis, API, Web).

Rationale:

  • One command to start entire stack
  • Consistent environment across developers
  • Easy cleanup and reset
  • Production-like local setup

Trade-offs: Higher resource usage than native processes, but worth consistency.


12. Monorepo Structure

Decision: Keep client and server in same repository with shared types.

Rationale:

  • Atomic commits across frontend/backend
  • Shared TypeScript types (API contracts)
  • Simplified CI/CD (single build pipeline)
  • Easier code reviews (see both sides of changes)

Trade-offs: Larger repository, but better developer experience.