From b213ffb0615fbef4fb9b32176c32a14f946ea155 Mon Sep 17 00:00:00 2001 From: Amber Agent Date: Mon, 8 Dec 2025 19:22:34 +0000 Subject: [PATCH] docs: Add comprehensive architecture diagrams MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add visual architecture documentation with Mermaid diagrams covering: - Core 4-component system architecture with data flows - Agentic session lifecycle state machine and reconciliation - Multi-tenancy architecture with namespace isolation and RBAC - Kubernetes Custom Resource structures and relationships Created docs/architecture/ directory with 5 files (3,155 lines): - index.md: Architecture navigation and quick start guide - core-system-architecture.md: System overview and component interactions - agentic-session-lifecycle.md: Session states and operator patterns - multi-tenancy-architecture.md: Project isolation and security - kubernetes-resources.md: CRD schemas and resource lifecycle Updated existing documentation: - docs/index.md: Added architecture section to main navigation - CLAUDE.md: Added references to architecture diagrams - mkdocs.yml: Integrated architecture pages into site navigation All diagrams use Mermaid format for GitHub/MkDocs compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 --- CLAUDE.md | 6 + .../architecture/agentic-session-lifecycle.md | 611 ++++++++++ docs/architecture/core-system-architecture.md | 402 +++++++ docs/architecture/index.md | 344 ++++++ docs/architecture/kubernetes-resources.md | 1042 +++++++++++++++++ .../multi-tenancy-architecture.md | 756 ++++++++++++ docs/index.md | 9 + mkdocs.yml | 6 + 8 files changed, 3176 insertions(+) create mode 100644 docs/architecture/agentic-session-lifecycle.md create mode 100644 docs/architecture/core-system-architecture.md create mode 100644 docs/architecture/index.md create mode 100644 docs/architecture/kubernetes-resources.md create mode 100644 docs/architecture/multi-tenancy-architecture.md diff --git a/CLAUDE.md b/CLAUDE.md index 5f8d5fb47..0adeea4d0 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -40,6 +40,12 @@ User Creates Session → Backend Creates CR → Operator Spawns Job → Pod Runs Claude CLI → Results Stored in CR → UI Displays Progress ``` +📐 **Architecture Diagrams:** See [docs/architecture/](docs/architecture/) for comprehensive visual guides including: +- [Core System Architecture](docs/architecture/core-system-architecture.md) - 4-component system with data flows +- [Agentic Session Lifecycle](docs/architecture/agentic-session-lifecycle.md) - State machine and reconciliation +- [Multi-Tenancy Architecture](docs/architecture/multi-tenancy-architecture.md) - Project isolation and RBAC +- [Kubernetes Resources](docs/architecture/kubernetes-resources.md) - CRD structures and relationships + ## Memory System - Loadable Context This repository uses a structured **memory system** to provide targeted, loadable context instead of relying solely on this comprehensive CLAUDE.md file. diff --git a/docs/architecture/agentic-session-lifecycle.md b/docs/architecture/agentic-session-lifecycle.md new file mode 100644 index 000000000..974982175 --- /dev/null +++ b/docs/architecture/agentic-session-lifecycle.md @@ -0,0 +1,611 @@ +# Agentic Session Lifecycle + +## Overview + +An **AgenticSession** represents a single AI-powered automation task. This document describes the complete lifecycle from creation to completion, including state transitions, operator reconciliation, and error handling. + +## State Machine + +```mermaid +stateDiagram-v2 + [*] --> Pending: User creates session
(Backend creates CR) + + Pending --> Running: Operator creates Job
Pod starts execution + + Running --> Completed: Job succeeds
Results captured + Running --> Failed: Job fails
Error captured + Running --> Timeout: Timeout exceeded
Job terminated + + Completed --> [*] + Failed --> [*] + Timeout --> [*] + + note right of Pending + Initial state + - CR exists + - No Job created yet + - Operator will reconcile + end note + + note right of Running + Active execution + - Job created + - Pod running + - Results streaming + - Status updates frequent + end note + + note right of Completed + Success terminal state + - Results in CR status + - Job succeeded + - Resources cleaned up + end note + + note right of Failed + Error terminal state + - Error message in CR + - Job failed + - Resources cleaned up + end note + + note right of Timeout + Timeout terminal state + - Job terminated + - Partial results captured + - Resources cleaned up + end note +``` + +## Phase Descriptions + +### Pending + +**Entry Condition:** Backend API creates AgenticSession CR + +**State Characteristics:** +- CR exists with `spec` populated +- No `status` or `status.phase = "Pending"` +- No Job created yet +- No Pod running + +**Next Transition:** Operator detects CR and creates Job → `Running` + +**Typical Duration:** 1-5 seconds + +--- + +### Running + +**Entry Condition:** Operator creates Job successfully + +**State Characteristics:** +- Job exists with OwnerReference to AgenticSession +- Pod scheduled and executing +- `status.phase = "Running"` +- `status.startTime` set +- `status.results` may contain partial results + +**Status Updates:** +- Operator monitors Job status every 5 seconds +- Runner updates CR with progress logs +- WebSocket broadcasts updates to frontend + +**Next Transitions:** +- Job succeeds → `Completed` +- Job fails → `Failed` +- Timeout exceeded → `Timeout` + +**Typical Duration:** 30 seconds to 2 hours (configurable) + +--- + +### Completed + +**Entry Condition:** Job completes successfully (exit code 0) + +**State Characteristics:** +- `status.phase = "Completed"` +- `status.completionTime` set +- `status.results` contains final output +- Per-repo `pushed` or `abandoned` status +- Job and Pod cleaned up (OwnerReference cascade) + +**Terminal State:** No further transitions + +**Typical Retention:** CR persists for audit/history (manual deletion or TTL) + +--- + +### Failed + +**Entry Condition:** Job fails (non-zero exit code) + +**State Characteristics:** +- `status.phase = "Failed"` +- `status.completionTime` set +- `status.message` contains error details +- `status.results` may contain partial output +- Job and Pod cleaned up + +**Common Failure Reasons:** +- Invalid Anthropic API key +- Git authentication failure +- Runner execution error +- Resource limits exceeded + +**Terminal State:** No further transitions + +**Typical Retention:** CR persists for debugging (manual deletion) + +--- + +### Timeout + +**Entry Condition:** Execution exceeds configured timeout + +**State Characteristics:** +- `status.phase = "Timeout"` +- `status.completionTime` set +- `status.message` indicates timeout +- `status.results` contains partial output +- Job terminated by operator +- Pod cleaned up + +**Timeout Configuration:** +- Default: 1 hour +- Configurable via `spec.timeout` (seconds) +- ProjectSettings can set default per project + +**Terminal State:** No further transitions + +--- + +## Operator Reconciliation Flow + +```mermaid +flowchart TD + Start([Watch Event:
AgenticSession Added/Modified]) + + Start --> GetCR[Get current CR from API] + GetCR --> Exists{CR exists?} + + Exists -->|No - IsNotFound| LogDelete[Log: Resource deleted] + LogDelete --> End([Return - Not an error]) + + Exists -->|Yes| GetPhase[Extract status.phase] + GetPhase --> CheckPhase{phase?} + + CheckPhase -->|Pending| CheckJob{Job exists?} + CheckPhase -->|Running| MonitorJob[Continue monitoring
goroutine exists] + CheckPhase -->|Completed/Failed/Timeout| End + + CheckJob -->|Yes| LogExists[Log: Job already exists] + LogExists --> End + + CheckJob -->|No| CreateJob[Create Job with:
- OwnerReference
- Runner image
- Env vars from ProjectSettings
- PVC mount] + + CreateJob --> JobCreated{Job created?} + + JobCreated -->|No| UpdateError[Update CR status:
phase=Failed
message=error] + UpdateError --> End + + JobCreated -->|Yes| UpdateRunning[Update CR status:
phase=Running
startTime=now] + UpdateRunning --> StartMonitor[Start goroutine:
monitorJob] + StartMonitor --> End + + MonitorJob --> End + + style Start fill:#e1f5ff + style End fill:#e1ffe1 + style CheckPhase fill:#fff4e1 + style CreateJob fill:#f0e1ff + style UpdateError fill:#ffe1e1 + style UpdateRunning fill:#e1ffe1 +``` + +## Job Monitoring Loop + +```mermaid +sequenceDiagram + participant Op as Operator
(goroutine) + participant K8s as Kubernetes API + participant CR as AgenticSession CR + participant Job as Job + participant Pod as Pod + + Note over Op: Started by
reconciliation loop + + loop Every 5 seconds + Op->>CR: Check if CR still exists + + alt CR deleted + CR-->>Op: IsNotFound error + Note over Op: Exit goroutine
(session deleted by user) + end + + Op->>Job: Get Job status + + alt Job deleted + Job-->>Op: IsNotFound error + Op->>CR: Update status:
phase=Failed
message="Job was deleted" + Note over Op: Exit goroutine + end + + Job-->>Op: Job status + + alt Job succeeded + Op->>CR: Update status:
phase=Completed
completionTime=now + Op->>Job: Delete Job
(cleanup) + Note over Op: Exit goroutine
(success) + + else Job failed + Op->>Pod: Get Pod logs
(last 100 lines) + Pod-->>Op: Error logs + Op->>CR: Update status:
phase=Failed
message=error
results=logs + Op->>Job: Delete Job
(cleanup) + Note over Op: Exit goroutine
(failure) + + else Job still running + Op->>CR: Update status:
progress info + Note over Op: Continue monitoring + end + + Note over Op: Check timeout + alt Timeout exceeded + Op->>Job: Delete Job
(terminate) + Op->>CR: Update status:
phase=Timeout
message="Exceeded timeout" + Note over Op: Exit goroutine
(timeout) + end + end +``` + +## Status Update Patterns + +### Operator Status Updates + +**Use Case:** Operator updates phase transitions + +**Pattern:** Update via `/status` subresource + +```go +// components/operator/internal/handlers/sessions.go +func updateAgenticSessionStatus(namespace, name string, updates map[string]interface{}) error { + gvr := types.GetAgenticSessionResource() + + // Get current CR + obj, err := config.DynamicClient.Resource(gvr). + Namespace(namespace). + Get(ctx, name, v1.GetOptions{}) + + if errors.IsNotFound(err) { + log.Printf("CR deleted, skipping status update") + return nil // Not an error + } + + // Initialize status if needed + if obj.Object["status"] == nil { + obj.Object["status"] = make(map[string]interface{}) + } + + status := obj.Object["status"].(map[string]interface{}) + for k, v := range updates { + status[k] = v + } + + // Update via /status subresource + _, err = config.DynamicClient.Resource(gvr). + Namespace(namespace). + UpdateStatus(ctx, obj, v1.UpdateOptions{}) + + if errors.IsNotFound(err) { + return nil // CR deleted during update + } + + return err +} +``` + +### Runner Status Updates + +**Use Case:** Runner pod updates results incrementally + +**Pattern:** Runner has minted token with limited permissions + +```python +# components/runners/claude-code-runner/runner.py +def update_session_status(results: Dict[str, Any]): + """Update CR status from runner pod.""" + try: + # Use minted token from Secret + token = os.environ.get("RUNNER_TOKEN") + + # Update via Kubernetes API + response = requests.patch( + f"{k8s_api}/apis/vteam.ambient-code/v1alpha1/namespaces/{namespace}/agenticsessions/{name}/status", + headers={"Authorization": f"Bearer {token}"}, + json={"status": {"results": results}} + ) + + response.raise_for_status() + except Exception as e: + log.error(f"Failed to update status: {e}") + # Non-fatal: operator will update eventually +``` + +## Resource Lifecycle and Cleanup + +```mermaid +graph TD + subgraph "Resource Creation" + CR[AgenticSession CR
Created by Backend] + Job[Job
Created by Operator] + Pod[Pod
Created by Job Controller] + Secret[Secret
Minted token] + PVC[PVC
Workspace storage] + end + + subgraph "OwnerReferences" + CR -->|controller=true| Job + Job -->|controller=true| Pod + CR -->|controller=true| Secret + end + + subgraph "Cleanup Scenarios" + Delete1[User deletes CR] + Delete2[Job completes
Operator deletes Job] + TTL[TTL expired
K8s deletes CR] + end + + Delete1 --> CascadeDelete1[Kubernetes cascades:
Job → Pod → Secret] + Delete2 --> NormalCleanup[Operator deletes Job
Pod cleaned by Job controller] + TTL --> CascadeDelete2[Same as user delete] + + style CR fill:#ffe1e1 + style Job fill:#fff4e1 + style Pod fill:#e1ffe1 + style Secret fill:#f0e1ff + style Delete1 fill:#ffe1e1 + style Delete2 fill:#e1ffe1 +``` + +**Key Cleanup Principles:** + +1. **OwnerReferences** ensure automatic cleanup when parent is deleted +2. **Controller=true** on primary owner (only one per resource) +3. **No BlockOwnerDeletion** (causes permission issues in multi-tenant) +4. Operator explicitly deletes Jobs on completion (don't wait for cascade) +5. PVCs persist for debugging (manual cleanup or TTL) + +**Reference:** [Backend/Operator Development Standards](../../CLAUDE.md#resource-management) + +--- + +## Error Handling Patterns + +### Non-Fatal Errors (Operator) + +**Scenario:** Resource deleted during processing + +```go +if errors.IsNotFound(err) { + log.Printf("AgenticSession %s no longer exists, skipping", name) + return nil // Not treated as error - user deleted it +} +``` + +### Retriable Errors (Operator) + +**Scenario:** Transient K8s API failure + +```go +if err != nil { + log.Printf("Failed to create Job: %v", err) + updateAgenticSessionStatus(ns, name, map[string]interface{}{ + "phase": "Error", + "message": fmt.Sprintf("Failed to create Job: %v", err), + }) + return fmt.Errorf("failed to create Job: %w", err) + // Operator watch loop will retry on next event +} +``` + +### Terminal Errors (Runner) + +**Scenario:** Invalid API key + +```python +try: + client = anthropic.Anthropic(api_key=api_key) + response = client.messages.create(...) +except anthropic.AuthenticationError as e: + # Update CR with terminal error + update_session_status({ + "phase": "Failed", + "message": f"Invalid Anthropic API key: {e}", + "completionTime": datetime.now().isoformat() + }) + sys.exit(1) # Exit pod with failure +``` + +--- + +## Interactive vs Batch Execution + +### Batch Mode (Default) + +**Characteristics:** +- Single prompt execution +- Timeout enforced (default 1 hour) +- Results written to CR on completion +- Pod exits after execution + +**Use Cases:** +- One-off automation tasks +- Scripted workflows +- RFE generation + +**Flow:** +``` +User → Prompt → Runner executes → Results → Pod exits +``` + +--- + +### Interactive Mode + +**Characteristics:** +- Long-running session (no timeout) +- User sends messages via inbox file +- Runner responds via outbox file +- Pod continues running until explicitly stopped + +**Use Cases:** +- Iterative development +- Multi-turn conversations +- Complex debugging sessions + +**Flow:** +``` +User → Initial prompt → Runner starts + ↓ +User writes to inbox → Runner reads → Executes → Writes to outbox + ↓ +User reads outbox → Continues conversation... + ↓ +User signals completion → Pod exits +``` + +**Configuration:** +```yaml +apiVersion: vteam.ambient-code/v1alpha1 +kind: AgenticSession +metadata: + name: interactive-session +spec: + interactive: true # Enable interactive mode + prompt: "Initial prompt" + repos: + - input: + url: https://github.com/org/repo + branch: main +``` + +**File Locations:** +- Inbox: `/workspace/inbox.txt` (user writes) +- Outbox: `/workspace/outbox.txt` (runner writes) +- Workspace: `/workspace/repos/` (cloned repositories) + +--- + +## Multi-Repo Execution + +```mermaid +flowchart LR + subgraph "AgenticSession Spec" + MainIdx[mainRepoIndex: 1] + Repos[repos array:
0: repo-A
1: repo-B
2: repo-C] + end + + subgraph "Runner Workspace" + WS[/workspace/repos/] + RepoA[repo-A/
cloned from repos[0]] + RepoB[repo-B/
cloned from repos[1]
WORKING DIRECTORY] + RepoC[repo-C/
cloned from repos[2]] + end + + subgraph "Status Tracking" + StatusA[repos[0].status:
pushed=true] + StatusB[repos[1].status:
pushed=true] + StatusC[repos[2].status:
abandoned=true] + end + + MainIdx -->|Specifies| RepoB + Repos --> WS + WS --> RepoA + WS --> RepoB + WS --> RepoC + + RepoA -.-> StatusA + RepoB -.-> StatusB + RepoC -.-> StatusC + + style RepoB fill:#e1ffe1 + style MainIdx fill:#fff4e1 +``` + +**Key Concepts:** + +1. **mainRepoIndex** (default: 0): Sets Claude Code working directory +2. **Cloning Order**: Repos cloned in array order +3. **Per-Repo Status**: Each repo tracked individually (pushed/abandoned) +4. **Cross-Repo References**: Claude can access all repos in workspace + +**Reference:** [ADR-0003: Multi-Repository Support](../adr/0003-multi-repo-support.md) + +--- + +## Timeout Handling + +### Timeout Configuration + +```yaml +apiVersion: vteam.ambient-code/v1alpha1 +kind: AgenticSession +spec: + timeout: 3600 # seconds (1 hour) +``` + +**Timeout Sources (priority order):** +1. `spec.timeout` on AgenticSession CR +2. `defaultTimeout` in ProjectSettings CR +3. Global default (1 hour) + +### Timeout Enforcement + +**Operator monitors elapsed time:** + +```go +func monitorJob(jobName, sessionName, namespace string) { + startTime := time.Now() + timeout := getTimeoutForSession(namespace, sessionName) + + for { + time.Sleep(5 * time.Second) + + elapsed := time.Since(startTime) + if elapsed > timeout { + log.Printf("Session %s exceeded timeout (%v)", sessionName, timeout) + + // Terminate Job + deleteJob(namespace, jobName) + + // Update CR status + updateAgenticSessionStatus(namespace, sessionName, map[string]interface{}{ + "phase": "Timeout", + "message": fmt.Sprintf("Exceeded timeout of %v", timeout), + "completionTime": time.Now().Format(time.RFC3339), + }) + + return // Exit monitoring + } + + // ... check Job status ... + } +} +``` + +**Graceful Shutdown:** +- Runner receives SIGTERM from Kubernetes +- Runner captures partial results +- Runner updates CR status before exit + +--- + +## Related Documentation + +- [Core System Architecture](./core-system-architecture.md) - Component overview +- [Kubernetes Resources](./kubernetes-resources.md) - CR schemas +- [Multi-Tenancy Architecture](./multi-tenancy-architecture.md) - Project isolation +- [Operator Development Standards](../../CLAUDE.md#operator-patterns) +- [ADR-0001: Kubernetes-Native Architecture](../adr/0001-kubernetes-native-architecture.md) diff --git a/docs/architecture/core-system-architecture.md b/docs/architecture/core-system-architecture.md new file mode 100644 index 000000000..5ae70e79f --- /dev/null +++ b/docs/architecture/core-system-architecture.md @@ -0,0 +1,402 @@ +# Core System Architecture + +## Overview + +The Ambient Code Platform follows a Kubernetes-native architecture with four primary components that work together to orchestrate AI-powered automation tasks. + +## High-Level Architecture + +```mermaid +graph TB + subgraph "User Interface" + UI[Frontend
NextJS + Shadcn UI
React Query] + end + + subgraph "API Layer" + API[Backend API
Go + Gin
REST + WebSocket] + end + + subgraph "Kubernetes Cluster" + subgraph "Control Plane" + OP[Agentic Operator
Go Controller
Watches CRs] + end + + subgraph "Custom Resources" + AS[AgenticSession
CR] + PS[ProjectSettings
CR] + RFE[RFEWorkflow
CR] + end + + subgraph "Execution" + JOB[Kubernetes Job] + POD[Runner Pod
Python + Claude SDK] + PVC[Persistent Volume
Workspace Storage] + end + end + + UI -->|HTTP/HTTPS
REST API + WS| API + API -->|K8s Dynamic Client
User Token| AS + API -->|K8s Dynamic Client
User Token| PS + API -->|K8s Dynamic Client
User Token| RFE + + OP -->|Watches| AS + OP -->|Watches| PS + OP -->|Watches| RFE + + OP -->|Creates & Monitors| JOB + JOB -->|Spawns| POD + POD -->|Mounts| PVC + + POD -->|Updates Status| AS + OP -->|Updates Status| AS + + AS -.->|OwnerReference| JOB + JOB -.->|OwnerReference| POD + + style UI fill:#e1f5ff + style API fill:#fff4e1 + style OP fill:#f0e1ff + style POD fill:#e1ffe1 + style AS fill:#ffe1e1 + style PS fill:#ffe1e1 + style RFE fill:#ffe1e1 +``` + +## Component Breakdown + +### 1. Frontend (NextJS + Shadcn UI) + +**Technology Stack:** +- NextJS 14+ with App Router +- Shadcn UI component library +- React Query for data fetching +- TypeScript for type safety + +**Responsibilities:** +- User interface for session management +- Real-time status updates via WebSocket +- Project and settings management +- RFE workflow visualization + +**Key Patterns:** +- Server-side rendering for performance +- Optimistic updates with React Query +- Type-safe API client integration + +**Reference:** [Frontend Development Standards](../../CLAUDE.md#frontend-development-standards) + +--- + +### 2. Backend API (Go + Gin) + +**Technology Stack:** +- Go 1.21+ +- Gin web framework +- Kubernetes Dynamic Client +- OpenShift OAuth integration + +**Responsibilities:** +- REST API for CRUD operations on Custom Resources +- WebSocket server for real-time updates +- Multi-tenant project isolation (namespace mapping) +- User authentication and authorization (RBAC) +- Git operations (clone, fork, PR creation) + +**Key Endpoints:** +- `/api/projects/:project/agentic-sessions` - Session management +- `/api/projects/:project/project-settings` - Configuration +- `/api/projects/:project/rfe-workflows` - RFE orchestration +- `/ws` - WebSocket for real-time updates + +**Key Patterns:** +- User token authentication for all operations +- Project-scoped endpoints with RBAC validation +- Middleware chain: Recovery → Logging → CORS → Auth → Validation +- Error handling with structured responses + +**Reference:** [Backend Development Standards](../../CLAUDE.md#backend-and-operator-development-standards) + +--- + +### 3. Agentic Operator (Go Controller) + +**Technology Stack:** +- Go 1.21+ +- Kubernetes controller-runtime patterns +- Watch/reconciliation loop +- Custom Resource Definitions (CRDs) + +**Responsibilities:** +- Watch AgenticSession, ProjectSettings, RFEWorkflow CRs +- Reconcile desired state with actual state +- Create and manage Kubernetes Jobs for session execution +- Monitor Job completion and update CR status +- Handle timeouts and cleanup + +**Reconciliation Flow:** +1. Watch for CR events (Added, Modified, Deleted) +2. Check resource phase (Pending, Running, Completed, Failed) +3. Create Job if phase is Pending +4. Monitor Job status and update CR +5. Handle errors and retries with exponential backoff + +**Key Patterns:** +- Reconnection logic for watch failures +- Idempotent resource creation +- OwnerReferences for automatic cleanup +- Status updates via `/status` subresource +- Goroutine monitoring for long-running jobs + +**Reference:** [Operator Development Standards](../../CLAUDE.md#operator-patterns) + +--- + +### 4. Claude Code Runner (Python) + +**Technology Stack:** +- Python 3.11+ +- Claude Code SDK (≥0.0.23) +- Anthropic API (≥0.68.0) +- Git integration + +**Responsibilities:** +- Execute Claude Code CLI in containerized environment +- Manage workspace synchronization via PVC +- Handle interactive vs. batch execution modes +- Capture results and update CR status +- Multi-agent collaboration coordination + +**Execution Modes:** +- **Batch Mode:** Single prompt execution with timeout +- **Interactive Mode:** Long-running chat using inbox/outbox files + +**Key Patterns:** +- Workspace isolation per session +- Multi-repo support with mainRepoIndex +- Result capture and structured output +- Error propagation to operator + +**Reference:** [Runner Documentation](../../components/runners/claude-code-runner/README.md) + +--- + +## Data Flow: Agentic Session Execution + +```mermaid +sequenceDiagram + actor User + participant UI as Frontend + participant API as Backend API + participant K8s as Kubernetes API + participant Op as Operator + participant Job as Job/Pod + participant CR as AgenticSession CR + + User->>UI: Create Session + UI->>API: POST /api/projects/{project}/agentic-sessions + + Note over API: Extract user token
Validate RBAC permissions + + API->>K8s: Create AgenticSession CR
(using user token) + K8s-->>API: CR Created (UID) + API-->>UI: 201 Created {name, uid} + + Note over Op: Watch loop detects
new CR event + + Op->>K8s: Get AgenticSession CR + K8s-->>Op: CR with phase=Pending + + Op->>K8s: Create Job with OwnerReference + Note over Op: Set controller=true
for automatic cleanup + + K8s-->>Op: Job Created + Op->>K8s: Update CR status
phase=Running + + K8s->>Job: Schedule Pod + + Note over Job: Runner executes
Claude Code CLI + + loop Monitoring + Op->>K8s: Check Job status + K8s-->>Op: Job status (running/succeeded/failed) + + Op->>K8s: Update CR status
(progress, logs, errors) + end + + Job->>K8s: Update CR status
(results, completionTime) + + Op->>K8s: Update CR status
phase=Completed + + K8s-->>API: Status change event + API-->>UI: WebSocket update + UI-->>User: Display results +``` + +## Multi-Tenancy Model + +```mermaid +graph LR + subgraph "Project A" + PA[Project 'team-alpha'] + NSA[Namespace: team-alpha] + ASA1[AgenticSession-1] + ASA2[AgenticSession-2] + PSA[ProjectSettings] + end + + subgraph "Project B" + PB[Project 'team-beta'] + NSB[Namespace: team-beta] + ASB1[AgenticSession-1] + PSB[ProjectSettings] + end + + PA -->|Maps to| NSA + PB -->|Maps to| NSB + + NSA -->|Contains| ASA1 + NSA -->|Contains| ASA2 + NSA -->|Contains| PSA + + NSB -->|Contains| ASB1 + NSB -->|Contains| PSB + + style PA fill:#e1f5ff + style PB fill:#ffe1e1 + style NSA fill:#e1f5ff + style NSB fill:#ffe1e1 +``` + +**Isolation Guarantees:** +- Each project maps to a dedicated Kubernetes namespace (1:1 mapping) +- User tokens enforce RBAC at namespace boundaries +- Resources cannot cross namespace boundaries +- Backend validates project access before CR operations + +**Reference:** [Multi-Tenancy Architecture](./multi-tenancy-architecture.md) + +--- + +## Key Architectural Decisions + +### 1. Kubernetes-Native Design + +**Why:** Leverage Kubernetes for orchestration, scheduling, resource management, and RBAC. + +**Benefits:** +- Declarative resource model via Custom Resources +- Built-in RBAC and multi-tenancy +- Horizontal scalability +- Self-healing and automatic cleanup via OwnerReferences + +**Reference:** [ADR-0001: Kubernetes-Native Architecture](../adr/0001-kubernetes-native-architecture.md) + +--- + +### 2. User Token Authentication + +**Why:** Enforce per-user RBAC for all API operations instead of using elevated service account permissions. + +**Pattern:** +- Frontend extracts user token from OAuth flow +- Backend validates token and uses it for K8s API calls +- Service account only for CR writes and token minting + +**Security Benefits:** +- Audit trail per user +- Least-privilege access +- No privilege escalation risks + +**Reference:** [ADR-0002: User Token Authentication](../adr/0002-user-token-authentication.md) + +--- + +### 3. Asynchronous Execution Model + +**Why:** Long-running AI tasks cannot block HTTP requests. + +**Pattern:** +- **Synchronous:** User request → Backend creates CR → Return immediately +- **Asynchronous:** Operator watches → Creates Job → Monitors → Updates status +- **Feedback:** WebSocket or polling for status updates + +**Benefits:** +- Responsive UI (no hanging requests) +- Resilient to operator/pod restarts +- Kubernetes handles scheduling and retries + +--- + +### 4. Go Backend + Python Runner + +**Why:** Use the best tool for each layer. + +**Rationale:** +- **Go for Backend/Operator:** Performance, K8s client libraries, concurrency +- **Python for Runner:** Claude SDK, rich AI/ML ecosystem, rapid development + +**Reference:** [ADR-0004: Go Backend + Python Runner](../adr/0004-go-backend-python-runner.md) + +--- + +## Component Communication Matrix + +| Source | Target | Protocol | Auth | Purpose | +|--------|--------|----------|------|---------| +| Frontend | Backend API | HTTPS (REST) | OAuth Token | CRUD operations | +| Frontend | Backend API | WebSocket | OAuth Token | Real-time updates | +| Backend API | Kubernetes API | K8s Dynamic Client | User Token | CR operations | +| Operator | Kubernetes API | K8s Dynamic Client | Service Account | Watch CRs, manage Jobs | +| Runner Pod | Kubernetes API | K8s Dynamic Client | Pod SA + Minted Token | Update CR status | +| Operator | Runner Job | - | OwnerReference | Lifecycle management | + +--- + +## Scalability Considerations + +### Horizontal Scaling + +**Frontend:** +- Stateless NextJS instances +- Scale with Kubernetes Deployment replicas +- Load balancing via Ingress/Route + +**Backend API:** +- Stateless Go instances +- Scale with Kubernetes Deployment replicas +- WebSocket sessions require session affinity (sticky sessions) + +**Operator:** +- Single-replica controller (leader election for HA) +- Watch multiple namespaces concurrently +- Goroutine per Job for monitoring + +**Runner Pods:** +- One Pod per AgenticSession (isolation) +- Kubernetes handles scheduling across nodes +- Resource limits prevent resource exhaustion + +### Resource Limits + +```yaml +# Example resource configuration +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2Gi" + cpu: "1000m" +``` + +**Reference:** [Production Considerations](../../CLAUDE.md#production-considerations) + +--- + +## Related Documentation + +- [Agentic Session Lifecycle](./agentic-session-lifecycle.md) - State machine and reconciliation flow +- [Multi-Tenancy Architecture](./multi-tenancy-architecture.md) - Project isolation and RBAC +- [Kubernetes Resources](./kubernetes-resources.md) - CRD structures and schemas +- [Backend Development Standards](../../CLAUDE.md#backend-and-operator-development-standards) +- [Frontend Development Standards](../../components/frontend/DESIGN_GUIDELINES.md) diff --git a/docs/architecture/index.md b/docs/architecture/index.md new file mode 100644 index 000000000..2a1da12b5 --- /dev/null +++ b/docs/architecture/index.md @@ -0,0 +1,344 @@ +# Architecture Overview + +Welcome to the **Ambient Code Platform Architecture Documentation**. This section provides comprehensive visual diagrams and detailed explanations of the platform's design, components, and patterns. + +## Purpose + +This architecture documentation helps you: + +- **Understand** the platform's component interactions and data flows +- **Navigate** complex distributed systems with clear visual aids +- **Make informed decisions** when extending or modifying the platform +- **Onboard quickly** with structured visual learning + +## Navigation Guide + +### Core Architecture + +Start here to understand the foundational platform architecture: + +| Document | Description | Key Diagrams | +|----------|-------------|--------------| +| **[Core System Architecture](./core-system-architecture.md)** | 4-component system overview, data flows, and component responsibilities | System architecture, sequence diagrams, multi-tenancy model | +| **[Agentic Session Lifecycle](./agentic-session-lifecycle.md)** | Session state machine, operator reconciliation, and execution patterns | State diagram, reconciliation flowchart, monitoring loop | +| **[Multi-Tenancy Architecture](./multi-tenancy-architecture.md)** | Project isolation, RBAC enforcement, and security boundaries | Namespace mapping, authentication flow, permission matrix | +| **[Kubernetes Resources](./kubernetes-resources.md)** | Custom Resource Definitions (CRDs), schemas, and resource relationships | CR hierarchy, class diagrams, cleanup strategies | + +--- + +## Quick Start by Role + +### For Developers + +**Start here if you're:** +- Adding new features to the backend or frontend +- Debugging session execution issues +- Understanding component interactions + +**Recommended Reading Order:** +1. [Core System Architecture](./core-system-architecture.md) - Get the big picture +2. [Agentic Session Lifecycle](./agentic-session-lifecycle.md) - Understand execution flow +3. [Kubernetes Resources](./kubernetes-resources.md) - Learn CR structures + +--- + +### For Platform Engineers + +**Start here if you're:** +- Deploying the platform to production +- Setting up multi-tenancy and RBAC +- Troubleshooting operator issues + +**Recommended Reading Order:** +1. [Core System Architecture](./core-system-architecture.md) - Component overview +2. [Multi-Tenancy Architecture](./multi-tenancy-architecture.md) - Isolation and security +3. [Agentic Session Lifecycle](./agentic-session-lifecycle.md) - Operator patterns + +--- + +### For Architects + +**Start here if you're:** +- Evaluating the platform for adoption +- Planning integrations or extensions +- Understanding architectural decisions + +**Recommended Reading Order:** +1. [Core System Architecture](./core-system-architecture.md) - Full system design +2. Review [Architecture Decision Records](../adr/) - Understand "why" behind decisions +3. [Multi-Tenancy Architecture](./multi-tenancy-architecture.md) - Security model +4. [Kubernetes Resources](./kubernetes-resources.md) - Resource model + +--- + +## Architectural Principles + +The Ambient Code Platform is built on these core principles: + +### 1. Kubernetes-Native Design + +**Why:** Leverage Kubernetes for orchestration, scheduling, and resource management. + +**How:** +- Custom Resource Definitions (CRDs) for declarative state +- Operator pattern for reconciliation +- Built-in RBAC for multi-tenancy +- OwnerReferences for automatic cleanup + +**Reference:** [ADR-0001: Kubernetes-Native Architecture](../adr/0001-kubernetes-native-architecture.md) + +--- + +### 2. User Token Authentication + +**Why:** Enforce per-user RBAC instead of using elevated service account permissions. + +**How:** +- Frontend extracts OAuth token +- Backend validates and uses token for K8s API calls +- Service account only for specific elevated operations (CR writes, token minting) + +**Reference:** [ADR-0002: User Token Authentication](../adr/0002-user-token-authentication.md) + +--- + +### 3. Asynchronous Execution + +**Why:** Long-running AI tasks cannot block HTTP requests. + +**How:** +- Synchronous: User request → Backend creates CR → Return immediately +- Asynchronous: Operator watches → Creates Job → Monitors → Updates status +- Feedback: WebSocket or polling for status updates + +**Benefits:** +- Responsive UI +- Resilient to restarts +- Kubernetes handles scheduling + +--- + +### 4. Multi-Repository Support + +**Why:** Real-world automation often requires changes across multiple codebases. + +**How:** +- Sessions can reference multiple Git repositories +- `mainRepoIndex` specifies working directory +- Per-repo status tracking (pushed, abandoned, PR URL) + +**Reference:** [ADR-0003: Multi-Repository Support](../adr/0003-multi-repo-support.md) + +--- + +### 5. Polyglot Architecture + +**Why:** Use the best language for each layer. + +**How:** +- **Go** for backend/operator: Performance, K8s libraries, concurrency +- **Python** for runner: Claude SDK, AI/ML ecosystem, rapid development +- **TypeScript/NextJS** for frontend: Modern web development, type safety + +**Reference:** [ADR-0004: Go Backend + Python Runner](../adr/0004-go-backend-python-runner.md) + +--- + +## System Components + +### Frontend (NextJS + Shadcn UI) + +**Purpose:** Web UI for session management and monitoring + +**Technology:** +- NextJS 14+ with App Router +- Shadcn UI component library +- React Query for data fetching +- TypeScript for type safety + +**Reference:** [Frontend Development Standards](../../components/frontend/DESIGN_GUIDELINES.md) + +--- + +### Backend API (Go + Gin) + +**Purpose:** REST API for CRUD operations on Custom Resources + +**Technology:** +- Go 1.21+ +- Gin web framework +- Kubernetes Dynamic Client +- OpenShift OAuth integration + +**Key Endpoints:** +- `/api/projects/:project/agentic-sessions` - Session management +- `/api/projects/:project/project-settings` - Configuration +- `/api/projects/:project/rfe-workflows` - RFE orchestration +- `/ws` - WebSocket for real-time updates + +**Reference:** [Backend Development Standards](../../CLAUDE.md#backend-and-operator-development-standards) + +--- + +### Agentic Operator (Go Controller) + +**Purpose:** Watch Custom Resources and reconcile state + +**Technology:** +- Go 1.21+ +- Kubernetes controller-runtime patterns +- Watch/reconciliation loop + +**Responsibilities:** +- Watch AgenticSession, ProjectSettings, RFEWorkflow CRs +- Create and manage Kubernetes Jobs +- Monitor Job completion and update CR status +- Handle timeouts and cleanup + +**Reference:** [Operator Development Standards](../../CLAUDE.md#operator-patterns) + +--- + +### Claude Code Runner (Python) + +**Purpose:** Execute Claude Code CLI in containerized environment + +**Technology:** +- Python 3.11+ +- Claude Code SDK (≥0.0.23) +- Anthropic API (≥0.68.0) +- Git integration + +**Responsibilities:** +- Execute AI-powered automation tasks +- Manage workspace synchronization +- Capture results and update CR status +- Handle interactive and batch modes + +**Reference:** [Runner Documentation](../../components/runners/claude-code-runner/README.md) + +--- + +## Data Flow Summary + +```mermaid +graph LR + User[User] -->|HTTPS| FE[Frontend] + FE -->|REST API| BE[Backend API] + BE -->|K8s Dynamic Client| K8s[Kubernetes API] + + K8s -->|CR Created| OP[Operator] + OP -->|Creates Job| JOB[Job] + JOB -->|Spawns Pod| POD[Runner Pod] + + POD -->|Updates Status| K8s + K8s -->|Status Change| BE + BE -->|WebSocket| FE + FE -->|Display| User + + style User fill:#e1f5ff + style FE fill:#fff4e1 + style BE fill:#ffe1e1 + style K8s fill:#f0e1ff + style OP fill:#e1ffe1 + style POD fill:#ffe1e1 +``` + +**High-Level Flow:** + +1. **User** interacts with **Frontend** UI +2. **Frontend** sends API request to **Backend** +3. **Backend** creates Custom Resource via **Kubernetes API** (using user token) +4. **Operator** detects CR and creates **Job** +5. **Job** spawns **Runner Pod** to execute task +6. **Runner** updates CR status with results +7. **Backend** sends WebSocket update to **Frontend** +8. **Frontend** displays results to **User** + +**Reference:** [Core System Architecture - Data Flow](./core-system-architecture.md#data-flow-agentic-session-execution) + +--- + +## Architecture Decision Records (ADRs) + +ADRs document **why** architectural decisions were made, not just **what** was implemented. + +| ADR | Title | Date | Status | +|-----|-------|------|--------| +| [0001](../adr/0001-kubernetes-native-architecture.md) | Kubernetes-Native Architecture | 2024-11 | Accepted | +| [0002](../adr/0002-user-token-authentication.md) | User Token Authentication for API Operations | 2024-11 | Accepted | +| [0003](../adr/0003-multi-repo-support.md) | Multi-Repository Support in AgenticSessions | 2024-11 | Accepted | +| [0004](../adr/0004-go-backend-python-runner.md) | Go Backend + Python Runner Technology Stack | 2024-11 | Accepted | +| [0005](../adr/0005-nextjs-shadcn-react-query.md) | NextJS + Shadcn + React Query Frontend Stack | 2024-11 | Accepted | + +**See also:** [Decision Log](../decisions.md) for chronological record of all major decisions. + +--- + +## Design Documents + +Detailed design documents for specific features: + +| Document | Description | +|----------|-------------| +| [Declarative Session Reconciliation](../design/declarative-session-reconciliation.md) | Operator reconciliation patterns | +| [Session Initialization Flows](../design/session-initialization-flows.md) | Session creation and startup | +| [Session Status Redesign](../design/session-status-redesign.md) | Status tracking and reporting | +| [Runner-Operator Contracts](../design/runner-operator-contracts.md) | Communication between runner and operator | + +--- + +## Related Context Files + +Loadable context files for specific development tasks: + +| Context File | Use When | +|--------------|----------| +| [Backend Development](../../.claude/context/backend-development.md) | Working on Go backend or operator | +| [Frontend Development](../../.claude/context/frontend-development.md) | Working on NextJS frontend | +| [Security Standards](../../.claude/context/security-standards.md) | Reviewing security practices | + +**Reference:** [Repomix Usage Guide](../../.claude/repomix-guide.md) for using architecture views. + +--- + +## Code Pattern Catalog + +Common patterns used throughout the codebase: + +| Pattern File | Description | +|--------------|-------------| +| [Error Handling](../../.claude/patterns/error-handling.md) | Consistent error patterns (backend, operator, runner) | +| [K8s Client Usage](../../.claude/patterns/k8s-client-usage.md) | When to use user token vs. service account | +| [React Query Usage](../../.claude/patterns/react-query-usage.md) | Data fetching patterns (queries, mutations, caching) | + +--- + +## Contributing to Architecture Docs + +When adding or updating architecture documentation: + +1. **Use Mermaid diagrams** for visualizations (compatible with MkDocs and GitHub) +2. **Follow established patterns** (see existing architecture docs for examples) +3. **Link to related documentation** (ADRs, design docs, code patterns) +4. **Update this index** when adding new architecture pages +5. **Test diagrams** at [mermaid.live](https://mermaid.live) before committing + +**Diagram Format Examples:** +- System architecture → `graph TB` or `graph LR` +- State transitions → `stateDiagram-v2` +- Workflows → `sequenceDiagram` +- Class structures → `classDiagram` +- Flows → `flowchart` + +--- + +## Questions or Feedback? + +For questions about the architecture: + +- **Technical questions:** See [Developer Guide](../developer/index.md) +- **Architecture proposals:** Create an issue with the `architecture` label +- **Corrections:** Submit a PR with proposed changes + +**Repository:** [https://github.com/ambient-code/platform](https://github.com/ambient-code/platform) diff --git a/docs/architecture/kubernetes-resources.md b/docs/architecture/kubernetes-resources.md new file mode 100644 index 000000000..c9f1f924f --- /dev/null +++ b/docs/architecture/kubernetes-resources.md @@ -0,0 +1,1042 @@ +# Kubernetes Custom Resources + +## Overview + +The Ambient Code Platform uses Kubernetes Custom Resource Definitions (CRDs) to represent AI automation tasks and configuration. This document details the structure, lifecycle, and relationships of the three primary CRDs. + +## Custom Resource Hierarchy + +```mermaid +graph TB + subgraph "Namespace: team-alpha" + PS[ProjectSettings
settings
API keys, defaults] + + AS1[AgenticSession
session-1
Batch mode] + AS2[AgenticSession
session-2
Interactive mode] + + RFE1[RFEWorkflow
rfe-auth-feature
7-step council] + + Job1[Job
session-1-runner] + Job2[Job
session-2-runner] + + Pod1[Pod
session-1-runner-xyz] + Pod2[Pod
session-2-runner-abc] + + Secret1[Secret
runner-token-session-1] + Secret2[Secret
runner-token-session-2] + + PVC1[PVC
workspace-session-1] + PVC2[PVC
workspace-session-2] + end + + PS -.->|Referenced by| AS1 + PS -.->|Referenced by| AS2 + PS -.->|Referenced by| RFE1 + + AS1 -->|OwnerReference
controller=true| Job1 + AS1 -->|OwnerReference
controller=true| Secret1 + + AS2 -->|OwnerReference
controller=true| Job2 + AS2 -->|OwnerReference
controller=true| Secret2 + + Job1 -->|OwnerReference
controller=true| Pod1 + Job2 -->|OwnerReference
controller=true| Pod2 + + Pod1 -.->|Mounts| PVC1 + Pod2 -.->|Mounts| PVC2 + + style PS fill:#ffe1e1 + style AS1 fill:#e1f5ff + style AS2 fill:#e1f5ff + style RFE1 fill:#fff4e1 + style Job1 fill:#f0e1ff + style Job2 fill:#f0e1ff +``` + +**Legend:** +- Solid arrows (→): OwnerReference (parent → child) +- Dashed arrows (-.->): Reference or mount (not ownership) + +--- + +## AgenticSession Custom Resource + +### Purpose + +Represents a single AI-powered automation task executed via Claude Code. + +### API Definition + +**Group:** `vteam.ambient-code` +**Version:** `v1alpha1` +**Kind:** `AgenticSession` +**Plural:** `agenticsessions` +**Shortname:** `as` + +### Resource Structure + +```mermaid +classDiagram + class AgenticSession { + +metadata ObjectMeta + +spec AgenticSessionSpec + +status AgenticSessionStatus + } + + class AgenticSessionSpec { + +prompt string + +repos []RepoConfig + +mainRepoIndex int + +interactive bool + +timeout int + +model string + +anthropicApiKeySecret string + } + + class RepoConfig { + +input RepoInput + +output RepoOutput + } + + class RepoInput { + +url string + +branch string + +authSecret string + } + + class RepoOutput { + +forkRepo string + +targetBranch string + +createPR bool + } + + class AgenticSessionStatus { + +phase string + +startTime string + +completionTime string + +results string + +message string + +repos []RepoStatus + } + + class RepoStatus { + +index int + +pushed bool + +prUrl string + +error string + } + + AgenticSession --> AgenticSessionSpec + AgenticSession --> AgenticSessionStatus + AgenticSessionSpec --> RepoConfig + RepoConfig --> RepoInput + RepoConfig --> RepoOutput + AgenticSessionStatus --> RepoStatus +``` + +### Spec Fields + +#### `spec.prompt` (required) + +**Type:** `string` + +**Description:** The instruction or task for Claude Code to execute. + +**Examples:** +```yaml +prompt: "Add unit tests for the authentication module" +``` + +```yaml +prompt: "Refactor the database connection logic to use connection pooling" +``` + +--- + +#### `spec.repos` (required) + +**Type:** `[]RepoConfig` + +**Description:** Array of Git repositories to operate on. At least one repo required. + +**Structure:** + +```yaml +repos: + - input: + url: "https://github.com/org/backend" + branch: "main" + authSecret: "git-credentials" # optional + output: + forkRepo: "https://github.com/user/backend" # optional + targetBranch: "feature/auth-refactor" # optional + createPR: true # optional +``` + +**Fields:** + +- **`input.url`** (required): Git repository URL (HTTPS or SSH) +- **`input.branch`** (required): Branch to clone and work on +- **`input.authSecret`** (optional): Secret name containing Git credentials +- **`output.forkRepo`** (optional): Fork repository URL for pushing changes +- **`output.targetBranch`** (optional): Target branch for PR creation +- **`output.createPR`** (optional): Whether to create PR after pushing + +**Reference:** [ADR-0003: Multi-Repository Support](../adr/0003-multi-repo-support.md) + +--- + +#### `spec.mainRepoIndex` (optional) + +**Type:** `int` + +**Description:** Index of the repository to use as Claude Code's working directory. + +**Default:** `0` (first repository) + +**Example:** + +```yaml +repos: + - input: + url: "https://github.com/org/shared-lib" + branch: "main" + - input: + url: "https://github.com/org/api-service" + branch: "develop" +mainRepoIndex: 1 # Work in api-service repo +``` + +--- + +#### `spec.interactive` (optional) + +**Type:** `bool` + +**Description:** Enable interactive mode for multi-turn conversations. + +**Default:** `false` (batch mode) + +**Interactive Mode:** +- Pod continues running after initial execution +- User sends messages via inbox file (`/workspace/inbox.txt`) +- Runner responds via outbox file (`/workspace/outbox.txt`) +- No timeout enforced + +**Example:** + +```yaml +interactive: true +prompt: "Help me debug the authentication flow" +``` + +--- + +#### `spec.timeout` (optional) + +**Type:** `int` + +**Description:** Timeout in seconds for batch mode execution. + +**Default:** Uses ProjectSettings default or 3600 (1 hour) + +**Ignored in interactive mode** + +**Example:** + +```yaml +timeout: 7200 # 2 hours +``` + +--- + +#### `spec.model` (optional) + +**Type:** `string` + +**Description:** Claude model to use for execution. + +**Default:** Uses ProjectSettings default or `claude-sonnet-4-5` + +**Valid Values:** +- `claude-opus-4-5` +- `claude-sonnet-4-5` +- `claude-haiku-4` + +**Example:** + +```yaml +model: "claude-opus-4-5" # Use most capable model +``` + +--- + +#### `spec.anthropicApiKeySecret` (optional) + +**Type:** `string` + +**Description:** Secret name containing Anthropic API key. + +**Default:** Uses ProjectSettings default + +**Secret Format:** + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: anthropic-api-key +type: Opaque +stringData: + ANTHROPIC_API_KEY: sk-ant-... +``` + +--- + +### Status Fields + +#### `status.phase` (set by operator) + +**Type:** `string` + +**Description:** Current phase of session execution. + +**Valid Values:** +- `Pending` - CR created, waiting for operator to create Job +- `Running` - Job created, pod executing +- `Completed` - Execution succeeded +- `Failed` - Execution failed +- `Timeout` - Execution exceeded timeout + +**Reference:** [Agentic Session Lifecycle](./agentic-session-lifecycle.md) + +--- + +#### `status.startTime` (set by operator) + +**Type:** `string` (RFC3339 timestamp) + +**Description:** When execution started (Job created). + +**Example:** `"2025-12-08T14:30:00Z"` + +--- + +#### `status.completionTime` (set by operator/runner) + +**Type:** `string` (RFC3339 timestamp) + +**Description:** When execution completed (success, failure, or timeout). + +**Example:** `"2025-12-08T15:45:00Z"` + +--- + +#### `status.results` (set by runner) + +**Type:** `string` + +**Description:** Execution results, logs, or output from Claude Code. + +**May contain:** +- Generated code snippets +- File paths modified +- Test results +- Error messages +- Partial results (if timeout/failure) + +--- + +#### `status.message` (set by operator/runner) + +**Type:** `string` + +**Description:** Human-readable status message (especially for errors). + +**Examples:** +- `"Execution completed successfully"` +- `"Failed to authenticate with Anthropic API"` +- `"Exceeded timeout of 3600 seconds"` +- `"Git repository not found"` + +--- + +#### `status.repos` (set by runner) + +**Type:** `[]RepoStatus` + +**Description:** Per-repository status tracking. + +**Structure:** + +```yaml +status: + repos: + - index: 0 + pushed: true + prUrl: "https://github.com/org/backend/pulls/123" + - index: 1 + pushed: false + error: "No changes to push" +``` + +**Fields:** + +- **`index`**: Corresponds to `spec.repos[index]` +- **`pushed`**: Whether changes were pushed to remote +- **`prUrl`**: Pull request URL (if created) +- **`error`**: Error message (if push/PR creation failed) + +--- + +### Complete Example + +```yaml +apiVersion: vteam.ambient-code/v1alpha1 +kind: AgenticSession +metadata: + name: add-auth-tests + namespace: team-alpha + labels: + project: backend-api + type: testing +spec: + prompt: | + Add comprehensive unit tests for the authentication module. + Ensure coverage of: + - Login/logout flows + - Token validation + - Password reset + - Edge cases (expired tokens, invalid credentials) + + repos: + - input: + url: "https://github.com/org/backend-api" + branch: "develop" + authSecret: "github-pat" + output: + forkRepo: "https://github.com/user/backend-api" + targetBranch: "feature/auth-tests" + createPR: true + + mainRepoIndex: 0 + interactive: false + timeout: 3600 + model: "claude-sonnet-4-5" + anthropicApiKeySecret: "anthropic-api-key" + +status: + phase: "Completed" + startTime: "2025-12-08T14:30:00Z" + completionTime: "2025-12-08T14:52:30Z" + results: | + Successfully added unit tests: + - tests/auth/test_login.py (12 tests) + - tests/auth/test_token_validation.py (8 tests) + - tests/auth/test_password_reset.py (6 tests) + + Coverage increased from 68% to 89% for auth module. + + message: "Execution completed successfully" + + repos: + - index: 0 + pushed: true + prUrl: "https://github.com/org/backend-api/pulls/456" +``` + +--- + +## ProjectSettings Custom Resource + +### Purpose + +Stores project-wide configuration such as default models, API keys, and timeout settings. + +### API Definition + +**Group:** `vteam.ambient-code` +**Version:** `v1alpha1` +**Kind:** `ProjectSettings` +**Plural:** `projectsettings` +**Shortname:** `ps` + +### Resource Structure + +```mermaid +classDiagram + class ProjectSettings { + +metadata ObjectMeta + +spec ProjectSettingsSpec + } + + class ProjectSettingsSpec { + +defaultModel string + +defaultTimeout int + +anthropicApiKeySecret string + +gitCredentialsSecret string + +enableAutoCleanup bool + +retentionDays int + } + + ProjectSettings --> ProjectSettingsSpec +``` + +### Spec Fields + +#### `spec.defaultModel` (optional) + +**Type:** `string` + +**Description:** Default Claude model for sessions without explicit `model` field. + +**Default:** `claude-sonnet-4-5` + +**Example:** + +```yaml +defaultModel: "claude-opus-4-5" # Use most capable model by default +``` + +--- + +#### `spec.defaultTimeout` (optional) + +**Type:** `int` + +**Description:** Default timeout (seconds) for batch mode sessions. + +**Default:** `3600` (1 hour) + +**Example:** + +```yaml +defaultTimeout: 7200 # 2 hours for complex tasks +``` + +--- + +#### `spec.anthropicApiKeySecret` (optional) + +**Type:** `string` + +**Description:** Default Secret name for Anthropic API key. + +**Sessions without explicit `anthropicApiKeySecret` use this default.** + +**Example:** + +```yaml +anthropicApiKeySecret: "anthropic-api-key" +``` + +--- + +#### `spec.gitCredentialsSecret` (optional) + +**Type:** `string` + +**Description:** Default Secret name for Git authentication. + +**Sessions without explicit `authSecret` in repo config use this default.** + +**Example:** + +```yaml +gitCredentialsSecret: "github-pat" +``` + +--- + +#### `spec.enableAutoCleanup` (optional) + +**Type:** `bool` + +**Description:** Enable automatic cleanup of completed sessions. + +**Default:** `false` + +**Example:** + +```yaml +enableAutoCleanup: true +retentionDays: 7 # Delete completed sessions after 7 days +``` + +--- + +#### `spec.retentionDays` (optional) + +**Type:** `int` + +**Description:** Days to retain completed sessions before auto-cleanup. + +**Default:** `7` + +**Only applies if `enableAutoCleanup: true`** + +--- + +### Complete Example + +```yaml +apiVersion: vteam.ambient-code/v1alpha1 +kind: ProjectSettings +metadata: + name: settings + namespace: team-alpha +spec: + defaultModel: "claude-sonnet-4-5" + defaultTimeout: 5400 # 90 minutes + anthropicApiKeySecret: "anthropic-api-key" + gitCredentialsSecret: "github-pat" + enableAutoCleanup: true + retentionDays: 14 +``` + +--- + +## RFEWorkflow Custom Resource + +### Purpose + +Orchestrates a 7-step agent council process for Request For Enhancement (RFE) refinement. + +### API Definition + +**Group:** `vteam.ambient-code` +**Version:** `v1alpha1` +**Kind:** `RFEWorkflow` +**Plural:** `rfeworkflows` +**Shortname:** `rfe` + +### Resource Structure + +```mermaid +classDiagram + class RFEWorkflow { + +metadata ObjectMeta + +spec RFEWorkflowSpec + +status RFEWorkflowStatus + } + + class RFEWorkflowSpec { + +request string + +context string + +repos []RepoConfig + +stepTimeout int + } + + class RFEWorkflowStatus { + +phase string + +currentStep int + +steps []StepStatus + +finalRFE string + +startTime string + +completionTime string + } + + class StepStatus { + +stepNumber int + +agent string + +status string + +output string + +startTime string + +completionTime string + } + + RFEWorkflow --> RFEWorkflowSpec + RFEWorkflow --> RFEWorkflowStatus + RFEWorkflowStatus --> StepStatus +``` + +### 7-Step Agent Council + +```mermaid +flowchart LR + Request[User Request] --> Step1 + + Step1[Step 1:
Product Manager
Requirements clarification] --> Step2 + Step2[Step 2:
Solution Architect
Technical design] --> Step3 + Step3[Step 3:
Staff Engineer
Implementation plan] --> Step4 + Step4[Step 4:
Product Owner
Acceptance criteria] --> Step5 + Step5[Step 5:
Team Lead
Task breakdown] --> Step6 + Step6[Step 6:
Team Member
Effort estimation] --> Step7 + Step7[Step 7:
Delivery Owner
Risk assessment] --> Final + + Final[Final RFE Document] + + style Request fill:#e1f5ff + style Final fill:#e1ffe1 + style Step1 fill:#ffe1e1 + style Step2 fill:#fff4e1 + style Step3 fill:#f0e1ff + style Step4 fill:#ffe1e1 + style Step5 fill:#fff4e1 + style Step6 fill:#f0e1ff + style Step7 fill:#ffe1e1 +``` + +**Agent Roles:** + +1. **Product Manager:** Clarifies requirements, defines user stories +2. **Solution Architect:** Designs technical architecture, identifies dependencies +3. **Staff Engineer:** Creates implementation plan, reviews code patterns +4. **Product Owner:** Defines acceptance criteria and success metrics +5. **Team Lead:** Breaks down into tasks, assigns priorities +6. **Team Member:** Estimates effort, identifies blockers +7. **Delivery Owner:** Assesses risks, creates rollback plan + +--- + +### Spec Fields + +#### `spec.request` (required) + +**Type:** `string` + +**Description:** Initial RFE request or feature description. + +**Example:** + +```yaml +request: | + Add support for OAuth2 authentication in the API. + Users should be able to authenticate using Google, GitHub, and Microsoft accounts. +``` + +--- + +#### `spec.context` (optional) + +**Type:** `string` + +**Description:** Additional context for the council (codebase state, constraints, preferences). + +**Example:** + +```yaml +context: | + - Existing authentication uses JWT tokens + - Frontend is React-based + - Backend is Go + Gin framework + - Prefer minimal dependencies +``` + +--- + +#### `spec.repos` (required) + +**Type:** `[]RepoConfig` + +**Description:** Repositories for council to analyze (same structure as AgenticSession). + +--- + +#### `spec.stepTimeout` (optional) + +**Type:** `int` + +**Description:** Timeout (seconds) per step. + +**Default:** `1800` (30 minutes) + +--- + +### Status Fields + +#### `status.phase` (set by operator) + +**Type:** `string` + +**Valid Values:** +- `Pending` - Workflow created, not started +- `Running` - Executing steps +- `Completed` - All steps completed +- `Failed` - One or more steps failed + +--- + +#### `status.currentStep` (set by operator) + +**Type:** `int` + +**Description:** Currently executing step (1-7). + +--- + +#### `status.steps` (set by operator/runner) + +**Type:** `[]StepStatus` + +**Description:** Status for each of the 7 steps. + +**Fields:** + +- **`stepNumber`**: 1-7 +- **`agent`**: Agent role (e.g., "Product Manager") +- **`status`**: `Pending`, `Running`, `Completed`, `Failed` +- **`output`**: Agent's output for this step +- **`startTime`**: RFC3339 timestamp +- **`completionTime`**: RFC3339 timestamp + +--- + +#### `status.finalRFE` (set by runner) + +**Type:** `string` + +**Description:** Final synthesized RFE document combining all agent outputs. + +--- + +### Complete Example + +```yaml +apiVersion: vteam.ambient-code/v1alpha1 +kind: RFEWorkflow +metadata: + name: oauth-authentication + namespace: team-alpha +spec: + request: | + Add OAuth2 authentication to the API supporting Google, GitHub, and Microsoft. + + context: | + - Current auth uses JWT tokens + - Backend: Go + Gin + - Frontend: React + NextJS + + repos: + - input: + url: "https://github.com/org/backend-api" + branch: "develop" + - input: + url: "https://github.com/org/frontend" + branch: "develop" + + stepTimeout: 1800 + +status: + phase: "Completed" + currentStep: 7 + + steps: + - stepNumber: 1 + agent: "Product Manager" + status: "Completed" + output: | + Requirements clarified: + - Support 3 OAuth providers + - Fallback to JWT for API clients + - User profile sync on first login + startTime: "2025-12-08T10:00:00Z" + completionTime: "2025-12-08T10:15:00Z" + + - stepNumber: 2 + agent: "Solution Architect" + status: "Completed" + output: | + Technical design: + - Use golang.org/x/oauth2 library + - Add OAuthProvider table (Postgres) + - Extend User model with provider_id field + - Create /auth/oauth/{provider} endpoints + startTime: "2025-12-08T10:15:00Z" + completionTime: "2025-12-08T10:35:00Z" + + # ... (steps 3-7) + + finalRFE: | + # RFE: OAuth2 Authentication + + ## Overview + Add OAuth2 authentication supporting Google, GitHub, and Microsoft. + + ## Requirements + - Support 3 OAuth providers + - Fallback to JWT for API clients + - User profile sync on first login + + ## Technical Design + - Use golang.org/x/oauth2 library + - Add OAuthProvider table + - Extend User model + - Create /auth/oauth/{provider} endpoints + + ## Implementation Plan + (Detailed steps from Staff Engineer) + + ## Acceptance Criteria + (Criteria from Product Owner) + + ## Task Breakdown + (Tasks from Team Lead) + + ## Effort Estimation + (Estimates from Team Member) + + ## Risk Assessment + (Risks and mitigation from Delivery Owner) + + startTime: "2025-12-08T10:00:00Z" + completionTime: "2025-12-08T13:45:00Z" +``` + +--- + +## OwnerReferences and Cleanup + +### OwnerReference Pattern + +**Purpose:** Automatic resource cleanup when parent is deleted. + +**Structure:** + +```yaml +apiVersion: vteam.ambient-code/v1alpha1 +kind: AgenticSession +metadata: + name: session-1 + namespace: team-alpha +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: session-1-runner + namespace: team-alpha + ownerReferences: + - apiVersion: vteam.ambient-code/v1alpha1 + kind: AgenticSession + name: session-1 + uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890 + controller: true + # blockOwnerDeletion: false (default, do not set to true) +``` + +**Key Fields:** + +- **`controller: true`**: Only ONE owner can be controller (primary parent) +- **`blockOwnerDeletion`**: **Omit this field** (causes permission issues in multi-tenant) + +**Cleanup Behavior:** + +1. User deletes AgenticSession CR +2. Kubernetes cascades delete to owned resources: + - Job (which cascades to Pod) + - Secret (runner token) + - PVC (workspace, if configured) + +**Reference:** [Backend/Operator Standards - OwnerReferences](../../CLAUDE.md#ownerreferences-pattern) + +--- + +### Cleanup Strategies + +#### Automatic (OwnerReferences) + +**When:** Parent CR deleted + +**How:** Kubernetes garbage collector cascades delete + +**Pros:** +- No manual cleanup required +- Consistent behavior +- Works even if operator is down + +**Cons:** +- Deletion order not controllable +- All child resources deleted (no selective retention) + +--- + +#### Manual (Operator Cleanup) + +**When:** Session completes successfully + +**How:** Operator explicitly deletes Job (Pod cleaned by Job controller) + +**Pattern:** + +```go +func cleanupCompletedSession(namespace, jobName string) { + policy := v1.DeletePropagationBackground + + err := K8sClient.BatchV1().Jobs(namespace).Delete( + context.Background(), jobName, v1.DeleteOptions{ + PropagationPolicy: &policy, + }) + + if err != nil && !errors.IsNotFound(err) { + log.Printf("Failed to delete job: %v", err) + } +} +``` + +**Pros:** +- Immediate cleanup on completion +- Selective retention (e.g., keep PVC, delete Job) + +**Cons:** +- Requires operator to be running +- More complex logic + +--- + +#### Time-Based (TTL) + +**When:** ProjectSettings enables `enableAutoCleanup` + +**How:** Operator periodically deletes old completed CRs + +**Pattern:** + +```go +func cleanupOldSessions(namespace string, retentionDays int) { + cutoff := time.Now().AddDate(0, 0, -retentionDays) + + list, _ := DynamicClient.Resource(gvr).Namespace(namespace).List( + context.Background(), v1.ListOptions{}) + + for _, item := range list.Items { + phase, _, _ := unstructured.NestedString(item.Object, "status", "phase") + if phase != "Completed" && phase != "Failed" { + continue // Only cleanup terminal states + } + + completionTime, _, _ := unstructured.NestedString(item.Object, "status", "completionTime") + if completionTime == "" { + continue + } + + t, err := time.Parse(time.RFC3339, completionTime) + if err != nil || t.After(cutoff) { + continue // Too recent or invalid timestamp + } + + // Delete old completed session + DynamicClient.Resource(gvr).Namespace(namespace).Delete( + context.Background(), item.GetName(), v1.DeleteOptions{}) + + log.Printf("Deleted old session %s (completed %s)", item.GetName(), completionTime) + } +} +``` + +**Pros:** +- Automatic space management +- Configurable retention period + +**Cons:** +- Loses audit trail (consider archiving first) +- Requires periodic operator execution + +--- + +## Related Documentation + +- [Core System Architecture](./core-system-architecture.md) - Component overview +- [Agentic Session Lifecycle](./agentic-session-lifecycle.md) - Session state machine +- [Multi-Tenancy Architecture](./multi-tenancy-architecture.md) - Namespace isolation +- [ADR-0001: Kubernetes-Native Architecture](../adr/0001-kubernetes-native-architecture.md) +- [ADR-0003: Multi-Repository Support](../adr/0003-multi-repo-support.md) diff --git a/docs/architecture/multi-tenancy-architecture.md b/docs/architecture/multi-tenancy-architecture.md new file mode 100644 index 000000000..ebdf86415 --- /dev/null +++ b/docs/architecture/multi-tenancy-architecture.md @@ -0,0 +1,756 @@ +# Multi-Tenancy Architecture + +## Overview + +The Ambient Code Platform implements **namespace-based multi-tenancy** where each project maps to a dedicated Kubernetes namespace. This ensures complete isolation between tenants while leveraging Kubernetes RBAC for fine-grained access control. + +## Project-to-Namespace Mapping + +```mermaid +graph TB + subgraph "Frontend Layer" + UI[User Interface
Project Selection] + end + + subgraph "Backend API Layer" + API[Backend API
Project Context Validation] + MW[Middleware:
ValidateProjectContext] + end + + subgraph "Kubernetes Cluster" + subgraph "Project: team-alpha" + NSA[Namespace: team-alpha] + RBA[RoleBinding: team-alpha-users] + ASA1[AgenticSession: session-1] + ASA2[AgenticSession: session-2] + PSA[ProjectSettings: settings] + PVC_A[PVC: workspace-session-1] + end + + subgraph "Project: team-beta" + NSB[Namespace: team-beta] + RBB[RoleBinding: team-beta-users] + ASB1[AgenticSession: session-1] + PSB[ProjectSettings: settings] + PVC_B[PVC: workspace-session-1] + end + + subgraph "Project: team-gamma" + NSC[Namespace: team-gamma] + RBC[RoleBinding: team-gamma-users] + ASC1[AgenticSession: session-1] + PSC[ProjectSettings: settings] + end + end + + UI -->|GET /api/projects| API + API -->|List namespaces
user has access to| NSA + API -->|List namespaces
user has access to| NSB + API -->|List namespaces
user has access to| NSC + + UI -->|POST /api/projects/team-alpha/agentic-sessions| MW + MW -->|Validate RBAC| RBA + MW -->|Create CR| ASA1 + + style NSA fill:#e1f5ff + style NSB fill:#ffe1e1 + style NSC fill:#e1ffe1 + style MW fill:#fff4e1 + style RBA fill:#f0e1ff + style RBB fill:#f0e1ff + style RBC fill:#f0e1ff +``` + +**Key Principles:** + +1. **1:1 Mapping:** Each project corresponds to exactly one Kubernetes namespace +2. **Namespace = Isolation Boundary:** Resources cannot cross namespace boundaries +3. **Project Name = Namespace Name:** Simplifies mapping and debugging +4. **RBAC Enforced:** User must have permissions on namespace to access project + +--- + +## User Authentication Flow + +```mermaid +sequenceDiagram + actor User + participant Browser + participant OAuth as OAuth Proxy
(OpenShift OAuth) + participant FE as Frontend + participant BE as Backend API + participant K8s as Kubernetes API + + User->>Browser: Access platform URL + Browser->>OAuth: Request (no token) + + Note over OAuth: User not authenticated + + OAuth->>User: Redirect to OpenShift login + User->>OAuth: Provide credentials + OAuth->>OAuth: Validate credentials
Generate OAuth token + + OAuth->>Browser: Set token in cookie/header + Browser->>FE: Load frontend app
(with token) + + Note over FE: Token stored in memory/cookie + + FE->>BE: API request
Authorization: Bearer {token} + + Note over BE: Extract token from header
X-Forwarded-User from OAuth proxy + + BE->>BE: Validate token format + + BE->>K8s: Create K8s client
with user token + + K8s-->>BE: Client configured + + BE->>K8s: Perform operation
(e.g., List AgenticSessions) + + Note over K8s: Kubernetes validates token
Checks RBAC permissions + + alt User has permissions + K8s-->>BE: Resources returned + BE-->>FE: 200 OK + data + else User lacks permissions + K8s-->>BE: 403 Forbidden + BE-->>FE: 403 Forbidden + end + + FE-->>Browser: Display result + Browser-->>User: Show UI +``` + +**Authentication Components:** + +1. **OAuth Proxy:** Intercepts requests, enforces authentication, injects X-Forwarded-User header +2. **Frontend:** Receives token, includes in all API requests +3. **Backend:** Extracts token, creates K8s client with user credentials +4. **Kubernetes API:** Validates token against ServiceAccount/User, enforces RBAC + +**Reference:** [ADR-0002: User Token Authentication](../adr/0002-user-token-authentication.md) + +--- + +## RBAC Model + +### Role Hierarchy + +```mermaid +graph TB + subgraph "Cluster Roles (Platform Admin)" + CA[ClusterRole:
cluster-admin] + CVR[ClusterRole:
vteam-view-all] + end + + subgraph "Namespace Roles (Project Team)" + NA[Role:
vteam-admin
(CRUD all resources)] + NE[Role:
vteam-editor
(CRUD sessions)] + NV[Role:
vteam-viewer
(Read-only)] + end + + subgraph "Service Accounts" + SAB[ServiceAccount:
backend
(CR writes, token minting)] + SAO[ServiceAccount:
operator
(Watch CRs, manage Jobs)] + SAR[ServiceAccount:
runner
(Update CR status)] + end + + CA -->|Has all permissions| NA + CA -->|Has all permissions| NE + CA -->|Has all permissions| NV + + CVR -->|Can read| NV + + NA -->|Includes| NE + NE -->|Includes| NV + + SAB -->|Bound to| NA + SAO -->|Bound to| NA + SAR -->|Bound to| NV + + style CA fill:#ffe1e1 + style CVR fill:#ffe1e1 + style NA fill:#e1f5ff + style NE fill:#fff4e1 + style NV fill:#e1ffe1 + style SAB fill:#f0e1ff + style SAO fill:#f0e1ff + style SAR fill:#f0e1ff +``` + +### Permission Matrix + +| Resource | vteam-viewer | vteam-editor | vteam-admin | backend SA | operator SA | +|----------|--------------|--------------|-------------|------------|-------------| +| **AgenticSession** | +| list | ✓ | ✓ | ✓ | ✓ | ✓ | +| get | ✓ | ✓ | ✓ | ✓ | ✓ | +| watch | - | - | - | - | ✓ | +| create | - | ✓ | ✓ | ✓ | - | +| update | - | ✓ | ✓ | ✓ | - | +| update/status | - | - | - | ✓ | ✓ | +| delete | - | ✓ | ✓ | ✓ | - | +| **ProjectSettings** | +| list | ✓ | ✓ | ✓ | ✓ | ✓ | +| get | ✓ | ✓ | ✓ | ✓ | ✓ | +| create | - | - | ✓ | ✓ | - | +| update | - | - | ✓ | ✓ | - | +| delete | - | - | ✓ | ✓ | - | +| **RFEWorkflow** | +| list | ✓ | ✓ | ✓ | ✓ | ✓ | +| get | ✓ | ✓ | ✓ | ✓ | ✓ | +| create | - | ✓ | ✓ | ✓ | - | +| update | - | ✓ | ✓ | ✓ | - | +| delete | - | ✓ | ✓ | ✓ | - | +| **Jobs** | +| list | ✓ | ✓ | ✓ | - | ✓ | +| get | ✓ | ✓ | ✓ | - | ✓ | +| create | - | - | - | - | ✓ | +| delete | - | - | ✓ | - | ✓ | +| **Secrets** | +| list | - | - | ✓ | ✓ | ✓ | +| get | - | - | ✓ | ✓ | ✓ | +| create | - | - | - | ✓ | ✓ | +| delete | - | - | ✓ | ✓ | ✓ | + +**Legend:** +- ✓ = Permission granted +- \- = Permission denied + +--- + +## Backend API Authorization Pattern + +### Middleware Chain + +```mermaid +flowchart LR + Req[HTTP Request] --> Recovery[gin.Recovery] + Recovery --> Logger[gin.Logger
Token redaction] + Logger --> CORS[CORS
middleware] + CORS --> Identity[forwardedIdentityMiddleware
Extract X-Forwarded-User] + Identity --> Validate[ValidateProjectContext
RBAC check] + Validate --> Handler[Route Handler
Business logic] + + style Req fill:#e1f5ff + style Validate fill:#fff4e1 + style Handler fill:#e1ffe1 +``` + +### User Token Extraction + +**Backend Pattern** (`components/backend/handlers/helpers.go`): + +```go +// GetK8sClientsForRequest creates K8s clients using user token from request +func GetK8sClientsForRequest(c *gin.Context) (*kubernetes.Clientset, dynamic.Interface) { + // 1. Extract Authorization header + rawAuth := c.GetHeader("Authorization") + if rawAuth == "" { + log.Printf("Missing Authorization header") + return nil, nil + } + + // 2. Parse Bearer token + parts := strings.SplitN(rawAuth, " ", 2) + if len(parts) != 2 || !strings.EqualFold(parts[0], "Bearer") { + log.Printf("Invalid Authorization header format") + return nil, nil + } + + token := strings.TrimSpace(parts[1]) + if token == "" { + log.Printf("Empty token") + return nil, nil + } + + log.Printf("Creating K8s client with user token (len=%d)", len(token)) + + // 3. Create K8s client with user token + config := &rest.Config{ + Host: os.Getenv("KUBERNETES_SERVICE_HOST"), + BearerToken: token, + TLSClientConfig: rest.TLSClientConfig{ + Insecure: false, + CAFile: "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt", + }, + } + + k8sClient, err := kubernetes.NewForConfig(config) + if err != nil { + log.Printf("Failed to create K8s client: %v", err) + return nil, nil + } + + dynClient, err := dynamic.NewForConfig(config) + if err != nil { + log.Printf("Failed to create dynamic client: %v", err) + return nil, nil + } + + return k8sClient, dynClient +} +``` + +### RBAC Validation Middleware + +**Pattern** (`components/backend/handlers/middleware.go`): + +```go +func ValidateProjectContext() gin.HandlerFunc { + return func(c *gin.Context) { + projectName := c.Param("projectName") + if projectName == "" { + c.JSON(http.StatusBadRequest, gin.H{"error": "Missing project name"}) + c.Abort() + return + } + + // Get user-scoped K8s client + reqK8s, _ := GetK8sClientsForRequest(c) + if reqK8s == nil { + c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) + c.Abort() + return + } + + // Check if user has access to namespace + ssar := &authv1.SelfSubjectAccessReview{ + Spec: authv1.SelfSubjectAccessReviewSpec{ + ResourceAttributes: &authv1.ResourceAttributes{ + Group: "vteam.ambient-code", + Resource: "agenticsessions", + Verb: "list", + Namespace: projectName, + }, + }, + } + + res, err := reqK8s.AuthorizationV1().SelfSubjectAccessReviews().Create( + context.Background(), ssar, v1.CreateOptions{}) + + if err != nil || !res.Status.Allowed { + c.JSON(http.StatusForbidden, gin.H{ + "error": fmt.Sprintf("No access to project %s", projectName), + }) + c.Abort() + return + } + + // Store project in context for handler + c.Set("project", projectName) + c.Next() + } +} +``` + +### Handler Usage + +**Example** (`components/backend/handlers/sessions.go`): + +```go +func ListSessions(c *gin.Context) { + project := c.GetString("project") // From middleware + + // Get user-scoped K8s clients + _, reqDyn := GetK8sClientsForRequest(c) + if reqDyn == nil { + c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid token"}) + return + } + + gvr := schema.GroupVersionResource{ + Group: "vteam.ambient-code", + Version: "v1alpha1", + Resource: "agenticsessions", + } + + // List sessions using user token (RBAC enforced by K8s) + list, err := reqDyn.Resource(gvr).Namespace(project).List( + context.Background(), v1.ListOptions{}) + + if err != nil { + log.Printf("Failed to list sessions in project %s: %v", project, err) + c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to list sessions"}) + return + } + + c.JSON(http.StatusOK, gin.H{"items": list.Items}) +} +``` + +**Key Security Patterns:** + +1. **Always use user token** for user-initiated operations +2. **Never fall back** to service account if user token is invalid +3. **Validate RBAC** before resource access +4. **Log securely** - never log token values (use `len(token)`) +5. **Return 401** for auth failures, **403** for authorization failures + +**Reference:** [Backend Development Standards](../../CLAUDE.md#user-scoped-clients-for-api-operations) + +--- + +## Service Account Usage + +### Backend Service Account + +**Purpose:** Limited elevated operations + +**Permissions:** +- Create/update Custom Resources (after user validation) +- Create Secrets for runner token minting +- Read ProjectSettings for configuration + +**Usage Pattern:** + +```go +// ONLY use backend service account for: +// 1. Writing CRs after user token validation +// 2. Minting runner tokens + +func CreateSession(c *gin.Context) { + project := c.GetString("project") + + // Step 1: Validate user has permission using USER TOKEN + reqK8s, reqDyn := GetK8sClientsForRequest(c) + if reqK8s == nil { + c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid token"}) + return + } + + // Validate user can create sessions + if !userCanCreateSessions(reqK8s, project) { + c.JSON(http.StatusForbidden, gin.H{"error": "No permission to create sessions"}) + return + } + + // Step 2: Create CR using BACKEND SERVICE ACCOUNT + // (user token may not have write permissions on status subresource) + obj := buildSessionObject(...) + + created, err := DynamicClient.Resource(gvr).Namespace(project).Create( + context.Background(), obj, v1.CreateOptions{}) + + if err != nil { + log.Printf("Failed to create session: %v", err) + c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to create session"}) + return + } + + // Step 3: Mint token for runner using BACKEND SERVICE ACCOUNT + runnerToken, err := mintRunnerToken(project, created.GetName()) + if err != nil { + log.Printf("Failed to mint runner token: %v", err) + // Continue - operator can handle missing token + } + + c.JSON(http.StatusCreated, gin.H{ + "name": created.GetName(), + "uid": created.GetUID(), + }) +} +``` + +**Never Use Backend Service Account For:** +- ❌ List/Get operations on behalf of users +- ❌ Delete operations initiated by users +- ❌ Skipping RBAC validation +- ❌ Accessing resources user doesn't have permission for + +--- + +### Operator Service Account + +**Purpose:** Watch and reconcile Custom Resources + +**Permissions:** +- Watch all Custom Resources (cluster-wide or namespace-scoped) +- Create/delete Jobs +- Create/delete Secrets +- Update CR status subresource + +**Usage Pattern:** + +```go +// Operator uses its service account for ALL operations +func WatchAgenticSessions() { + gvr := types.GetAgenticSessionResource() + + // Watch using operator's service account + watcher, err := config.DynamicClient.Resource(gvr).Watch( + context.Background(), v1.ListOptions{}) + + if err != nil { + log.Printf("Failed to create watcher: %v", err) + return + } + + for event := range watcher.ResultChan() { + obj := event.Object.(*unstructured.Unstructured) + handleAgenticSession(obj) + } +} +``` + +**Note:** Operator has **cluster-wide permissions** to watch and reconcile resources across all namespaces. This is acceptable because: +1. Operator is trusted infrastructure component +2. Operator only automates declarative state (no user input) +3. Operator does not expose user-facing API + +--- + +### Runner Service Account + +**Purpose:** Update CR status from pod + +**Permissions:** +- Update `/status` subresource for parent AgenticSession +- Read ConfigMaps/Secrets in namespace +- Limited read access to other CRs (for RFE workflows) + +**Token Minting:** + +Backend mints a time-limited token for runner: + +```go +func mintRunnerToken(namespace, sessionName string) (string, error) { + // Create ServiceAccount for runner + sa := &corev1.ServiceAccount{ + ObjectMeta: v1.ObjectMeta{ + Name: fmt.Sprintf("runner-%s", sessionName), + Namespace: namespace, + }, + } + + _, err := K8sClient.CoreV1().ServiceAccounts(namespace).Create( + context.Background(), sa, v1.CreateOptions{}) + + if err != nil && !errors.IsAlreadyExists(err) { + return "", err + } + + // Create token for ServiceAccount + treq := &authv1.TokenRequest{ + Spec: authv1.TokenRequestSpec{ + ExpirationSeconds: int64Ptr(3600), // 1 hour + }, + } + + token, err := K8sClient.CoreV1().ServiceAccounts(namespace).CreateToken( + context.Background(), sa.Name, treq, v1.CreateOptions{}) + + if err != nil { + return "", err + } + + return token.Status.Token, nil +} +``` + +**Usage in Runner:** + +```python +# Runner reads minted token from environment +token = os.environ.get("RUNNER_TOKEN") + +# Use token to update CR status +requests.patch( + f"{k8s_api}/apis/vteam.ambient-code/v1alpha1/namespaces/{namespace}/agenticsessions/{name}/status", + headers={"Authorization": f"Bearer {token}"}, + json={"status": {"results": results}} +) +``` + +--- + +## Isolation Guarantees + +### Namespace Isolation + +**What's Isolated:** +- ✓ Custom Resources (AgenticSession, ProjectSettings, RFEWorkflow) +- ✓ Jobs and Pods +- ✓ Secrets and ConfigMaps +- ✓ PersistentVolumeClaims +- ✓ NetworkPolicies (if configured) + +**What's Shared:** +- Kubernetes cluster infrastructure (nodes, storage classes) +- CRDs (cluster-scoped) +- ClusterRoles and ClusterRoleBindings +- Platform services (backend, operator) + +### RBAC Isolation + +**User A (team-alpha):** +- ✓ Can list/create/delete sessions in `team-alpha` namespace +- ❌ Cannot list sessions in `team-beta` namespace +- ❌ Cannot modify ProjectSettings in `team-gamma` namespace + +**User B (team-beta):** +- ✓ Can list sessions in `team-beta` namespace +- ❌ Cannot access `team-alpha` resources +- ❌ Cannot create sessions in `team-gamma` namespace + +**Enforcement:** +- Backend validates user token + RBAC before operations +- Kubernetes API enforces RBAC on every request +- Operator uses namespace-scoped clients where possible + +### Resource Quotas (Optional) + +**Per-Namespace Limits:** + +```yaml +apiVersion: v1 +kind: ResourceQuota +metadata: + name: project-quota + namespace: team-alpha +spec: + hard: + requests.cpu: "10" + requests.memory: "20Gi" + limits.cpu: "20" + limits.memory: "40Gi" + pods: "50" + persistentvolumeclaims: "10" +``` + +**Prevents:** +- Resource exhaustion by single tenant +- Noisy neighbor problems +- Runaway session costs + +--- + +## Security Boundaries + +```mermaid +graph TB + subgraph "External" + User[User Browser] + Git[Git Repositories] + end + + subgraph "Platform Boundary" + OAuth[OAuth Proxy
Authentication] + end + + subgraph "API Boundary" + BE[Backend API
RBAC Validation] + end + + subgraph "Kubernetes RBAC Boundary" + K8s[Kubernetes API
Token + RBAC enforcement] + end + + subgraph "Namespace: team-alpha" + NSA[Resources for team-alpha] + PodA[Runner Pod A] + end + + subgraph "Namespace: team-beta" + NSB[Resources for team-beta] + PodB[Runner Pod B] + end + + User -->|HTTPS| OAuth + OAuth -->|Token| BE + BE -->|User Token| K8s + + K8s -->|RBAC allows| NSA + K8s -.->|RBAC denies| NSB + + NSA -->|Contains| PodA + NSB -->|Contains| PodB + + PodA -.->|Cannot access| NSB + PodB -.->|Cannot access| NSA + + PodA -->|Can clone| Git + PodB -->|Can clone| Git + + style OAuth fill:#ffe1e1 + style BE fill:#fff4e1 + style K8s fill:#f0e1ff + style NSA fill:#e1f5ff + style NSB fill:#ffe1e1 +``` + +**Security Layers:** + +1. **OAuth Proxy:** Ensures user is authenticated +2. **Backend API:** Validates user token + RBAC permissions +3. **Kubernetes API:** Enforces RBAC on every resource access +4. **Namespace Isolation:** Resources cannot cross boundaries +5. **NetworkPolicies (optional):** Restrict pod-to-pod communication + +--- + +## Project Lifecycle + +### Project Creation + +```mermaid +sequenceDiagram + actor Admin + participant UI as Frontend + participant API as Backend API + participant K8s as Kubernetes + + Admin->>UI: Create new project "team-delta" + UI->>API: POST /api/projects
{"name": "team-delta"} + + API->>K8s: Create Namespace
name: team-delta + + K8s-->>API: Namespace created + + API->>K8s: Create RoleBinding
vteam-admin → admin user + + API->>K8s: Create ProjectSettings CR
(default configuration) + + K8s-->>API: Resources created + + API-->>UI: 201 Created + UI-->>Admin: Project ready +``` + +### Project Deletion + +```mermaid +sequenceDiagram + actor Admin + participant UI as Frontend + participant API as Backend API + participant K8s as Kubernetes + + Admin->>UI: Delete project "team-delta" + UI->>API: DELETE /api/projects/team-delta + + API->>K8s: Delete Namespace
team-delta + + Note over K8s: Cascade delete ALL resources:
- AgenticSessions
- Jobs/Pods
- Secrets
- PVCs
- ProjectSettings + + K8s-->>API: Namespace deleted + + API-->>UI: 204 No Content + UI-->>Admin: Project deleted +``` + +**Cleanup:** +- Kubernetes automatically deletes all resources in namespace +- No manual cleanup required +- PVCs deleted (data loss - consider backups) + +--- + +## Related Documentation + +- [Core System Architecture](./core-system-architecture.md) - Component overview +- [Agentic Session Lifecycle](./agentic-session-lifecycle.md) - Session execution flow +- [Backend Development Standards](../../CLAUDE.md#backend-and-operator-development-standards) +- [ADR-0001: Kubernetes-Native Architecture](../adr/0001-kubernetes-native-architecture.md) +- [ADR-0002: User Token Authentication](../adr/0002-user-token-authentication.md) +- [Security Standards Context](./.claude/context/security-standards.md) diff --git a/docs/index.md b/docs/index.md index 6a6a8db58..de9813c97 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,6 +17,8 @@ The platform follows a cloud-native microservices architecture: - Custom Resource Definitions (AgenticSession, ProjectSettings, RFEWorkflow) - Operator-based reconciliation for declarative session management +📐 **[Architecture Diagrams](architecture/index.md)** - Visual guides to system design, component interactions, and data flows + ## Quick Start ### Local Development @@ -64,6 +66,13 @@ For production OpenShift clusters: ## Documentation Structure +### [📐 Architecture](architecture/index.md) +Visual guides and detailed explanations of the platform's design: +- [Core System Architecture](architecture/core-system-architecture.md) - 4-component system overview +- [Agentic Session Lifecycle](architecture/agentic-session-lifecycle.md) - State machine and reconciliation +- [Multi-Tenancy Architecture](architecture/multi-tenancy-architecture.md) - Project isolation and RBAC +- [Kubernetes Resources](architecture/kubernetes-resources.md) - CRD structures and schemas + ### [📘 User Guide](user-guide/index.md) Learn how to use the Ambient Code Platform for AI-powered automation: - [Getting Started](user-guide/getting-started.md) - Installation and first session diff --git a/mkdocs.yml b/mkdocs.yml index 0d80bbeae..c64a3116f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -41,6 +41,12 @@ theme: nav: - Home: index.md + - Architecture: + - Overview: architecture/index.md + - Core System Architecture: architecture/core-system-architecture.md + - Agentic Session Lifecycle: architecture/agentic-session-lifecycle.md + - Multi-Tenancy Architecture: architecture/multi-tenancy-architecture.md + - Kubernetes Resources: architecture/kubernetes-resources.md - User Guide: - Overview: user-guide/index.md - Getting Started: user-guide/getting-started.md