Skip to content

Conversation

@cgwalters
Copy link

@cgwalters cgwalters commented Nov 25, 2025

Add architectural decision record documenting how to support user-provided container images with agent code injection. This addresses the requirement to let users run agentic sessions in their own container images with their own tools without forcing them to extend our base image.

@github-actions

This comment has been minimized.

@jeremyeder
Copy link
Collaborator

@cgwalters hey, here is what I was talking about bring-your-own: https://github.com/ambient-code/workflows/tree/main/workflows/template-workflow#directory-structure

All that stuff gets cloned into your session PVC to be made available to claude and agents.

I am using Workflows as the top level thing that teams use to "onboard" to the system.

Workflows don't include a way to customize the runner image, either by having a custom image or customizing the system's base image. So...

@MichaelClifford any concerns with this?

@cgwalters cgwalters force-pushed the adr-0006 branch 2 times, most recently from d5715d0 to 97110ef Compare November 26, 2025 16:58
@cgwalters
Copy link
Author

FTR I'm keeping this draft for now, I'm still refining this and playing with the experimental implementation.

@github-actions
Copy link
Contributor

Claude Code Review

Summary

This PR introduces ADR-0006 which documents a workspace container architecture that enables users to provide custom container images while running Claude Code agent sessions. The implementation adds support for a two-container pattern with the agent running in one container and executing commands in a separate user-provided workspace container via kubectl exec.

Overall Assessment: Strong architectural documentation with a working implementation. However, there are several critical security vulnerabilities and inconsistencies between the ADR documentation and actual implementation that must be addressed before merge.

Issues by Severity

Blocker Issues

  1. Command Injection Vulnerability in kubectl exec (workspace_exec.py:95)
  • The workdir parameter is directly interpolated into a shell command without sanitization
  • Attack scenario: workdir = "/workspace; rm -rf / #" leads to command injection
  • Required fix: Use shlex.quote() for workdir parameter
  • Files: components/runners/claude-code-runner/mcp_servers/workspace_exec.py:95, line 156
  1. Missing RBAC Resource Cleanup (sessions.go:1582-1646)
  • The ensureRunnerServiceAccount() function creates RBAC resources but these are never cleaned up
  • No OwnerReferences set on created RBAC resources
  • Resources persist after namespace/session deletion
  • Required fix: Add OwnerReferences or implement explicit cleanup
  1. ADR Documentation Contradicts Implementation
  • The ADR mentions Pattern A and Pattern B but implementation only implements kubectl exec
  • PR description says Pattern A is primary but code comments reference Pattern B
  • Required fix: Remove Pattern A/B references or clarify what they mean

Critical Issues

  1. Service Account Token Auto-Mount Security Risk (sessions.go:447)
  • When workspaceMode is enabled, SA token is mounted in both containers
  • User workspace container should NOT have Kubernetes API access
  • Required fix: Mount token selectively only in agent container
  1. Missing Input Validation for workspaceImage (sessions.go:344)
  • The workspaceImage field is used directly without validation
  • Users could specify malicious/backdoored images
  • Recommended fix: Add validation for image registry/format
  1. ShareProcessNamespace Enabled Without Documentation (sessions.go:449)
  • Security implications not documented
  • Only required for nsenter mode which ADR rejects
  • Required action: Remove ShareProcessNamespace or document why it's needed
  1. No Testing for Workspace Container Mode
  • Test files only test MCP server in isolation
  • Missing integration tests for actual AgenticSession with workspaceImage
  • Missing E2E tests for kubectl exec end-to-end

Major Issues

  1. Error Handling Violations - Silent Failures (sessions.go:351-353)
  • Session continues even though RBAC setup failed
  • Fix: Return error if RBAC setup fails
  1. Idempotency Issues in ensureRunnerServiceAccount (sessions.go:1592, 1616, 1640)
  • Doesn't handle resource updates
  • Could leave stale RBAC in place
  1. Missing CRD Field: workspacePodTemplate
  • ADR extensively documents workspacePodTemplate but it's not in the CRD
  • Users cannot customize resources, env vars, volume mounts as documented
  1. Inconsistent Logging Levels (workspace_exec.py)
  • User commands logged at INFO level (potential sensitive data)
  • Recommendation: Use structured logging with sensitivity awareness

Minor Issues

  1. Hardcoded Timeout Values - Make configurable via environment variable
  2. Missing Dependency: aiohttp not in requirements file
  3. Dockerfile kubectl Installation Not Pinned - Pin to specific version
  4. Test File Incomplete (integration_test_pattern_b.py:188)
  5. Code Duplication in exec Functions
  6. Magic Numbers Without Constants

Positive Highlights

  • Excellent ADR Documentation
  • Clean Separation of Concerns
  • MCP Integration is clever
  • RBAC Least Privilege correctly scoped
  • Good Test Coverage Intent
  • Backward Compatibility maintained
  • Generally good logging

Recommendations

Immediate Actions (Required for Merge)

  1. Fix command injection vulnerability - Use shlex.quote() for workdir
  2. Implement RBAC cleanup - Add OwnerReferences or explicit cleanup
  3. Resolve ADR/implementation mismatch - Clarify Pattern A/B
  4. Fix SA token security - Mount token only in agent container
  5. Add workspaceImage validation
  6. Document/remove ShareProcessNamespace
  7. Add integration tests
  8. Fix error handling - Return errors instead of warnings
  9. Add missing CRD fields or remove from ADR

Summary: Strong architectural work with good intent, but must address critical security issues before merge. The command injection and RBAC cleanup issues are blockers.

@codecov
Copy link

codecov bot commented Dec 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@cgwalters
Copy link
Author

Should we try to separate a pull request adding an ADR from one adding an implementation? (Kind of like RFC vs code?)

@cgwalters
Copy link
Author

OK I completely reworked this as I decided it's just way simpler to skip things like custom MCP servers and basically give the agent sugar on top of kubectl as a way to create pods for its work.

I also dropped out the PoC implementation work I had; feels cleaner to keep that as a separate PR. Going to have the agent try updating the PoC to match this latest implementation proposal to see how it works/feels.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

Claude Code Review

Summary

This ADR proposes a well-thought-out architecture for injecting agent orchestration capabilities into user-provided container images. The document is comprehensive, detailed, and demonstrates strong technical reasoning. The proposed solution (agent-orchestrated pod spawning via ambient-agentctl) is sound and aligns well with Kubernetes patterns and the platform's existing architecture.

However, there are several security, documentation, and architectural concerns that should be addressed before merging, particularly around session isolation, RBAC scope clarity, and alignment with the ADR template format.


Issues by Severity

🔴 Critical Issues

1. Session Isolation Security Model is Underspecified

Location: Lines 412-471 (Session Isolation section)

Issue: The ADR acknowledges that the current "soft isolation" model allows sessions within the same project namespace to interfere with each other (line 433-438), but treats this as acceptable without sufficient justification.

Security implications:

  • Session A can kubectl exec into Session B's workspace pod
  • Session A can delete Session B's pods
  • Session A can read Session B's secrets (if mounted)
  • This violates the principle of least privilege and creates a significant attack surface

The ADR states (line 471): "this is not a security boundary - an agent can bypass ambient-agentctl using raw kubectl commands"

Problem: If this isn't a security boundary, what is? The platform allows arbitrary code execution (Claude Code agents), so isolation between sessions is critical. A malicious or compromised agent in one session could disrupt or exfiltrate data from other sessions in the same project.

Recommendation:

  • Either: Implement namespace-per-session isolation now (marked as "Future Work" on line 473), OR
  • Clearly document this as a known security limitation in the ADR with:
    • Explicit threat model: Who can exploit this? (malicious user, compromised agent, buggy agent)
    • Mitigation: Projects should only run trusted sessions, or use separate projects for sensitive workloads
    • Roadmap: When will namespace-per-session be implemented?
    • Security posture: This architecture is suitable for development/testing, not multi-tenant production

2. RBAC Permissions Scope Ambiguity

Location: Lines 320-376 (Agent Container Requirements)

Issue: The RBAC section (line 322-368) shows namespace-scoped permissions but doesn't clearly state which namespace the ServiceAccount operates in.

Confusion points:

  • Line 139: Agent pod runs in my-project namespace
  • Line 372: "Agent can only manage pods in its own session namespace"
  • Line 381: "Agent can only create/exec pods in its own session namespace"

Question: Is "session namespace" the same as "project namespace"? The ADR uses both terms inconsistently.

From Session Isolation section (line 430): "All AgenticSessions within a project currently run in the same Kubernetes namespace"

This confirms they're the same, but the term "session namespace" is misleading.

Recommendation:

  • Use consistent terminology: "project namespace" (since that's what it actually is per existing architecture)
  • Update line 372 and 381 to say: "Agent can only manage pods in its project namespace (where multiple sessions may coexist)"
  • Add explicit security note: "In the current architecture, sessions share a project namespace. This means the ServiceAccount can access ALL pods in the project, not just its own session's pods. See Session Isolation section."

3. No Validation Section

Location: Missing (should follow Implementation Notes per template)

Issue: ADR-0001 and ADR-0002 both include "Validation" sections with success metrics and lessons learned. This ADR has no validation section, even though it's marked as Status: Proposed (line 3).

Problem:

  • How will success be measured?
  • What tests need to pass before this is "Accepted"?
  • What are the acceptance criteria?

Recommendation:
Add a Validation section before merging with acceptance criteria, success metrics, and open questions.


🟡 Major Issues

4. Missing Deciders and Technical Story Metadata

Location: Lines 3-5

Issue: The ADR metadata is incomplete compared to the template (template.md:2-6). Missing Deciders: field and Technical Story: field.

Other ADRs include this:

  • ADR-0001: "Deciders: Platform Architecture Team"
  • ADR-0002: "Deciders: Security Team, Platform Team" + "Technical Story: Security audit revealed RBAC bypass"

Recommendation: Add missing metadata fields per ADR template.

5. Security Context Recommendations are Weak

Location: Lines 385-410 (Security Considerations - Pod Escape Mitigation)

Issue: Security settings are marked as "SHOULD" (line 386) rather than "MUST", and the example is buried in a reference (line 398).

Current wording (line 386-391):

All agent-spawned workspace pods SHOULD use SecurityContext
Recommended settings: [...]

Problem:

  • "SHOULD" and "Recommended" are too weak for a platform that executes arbitrary user code
  • The ADR mentions PodSecurityStandards enforcement (line 391) but doesn't specify which level (restricted/baseline)
  • No guidance on who enforces this (operator? ambient-agentctl? platform admin?)

Recommendation: Make security contexts MANDATORY by default, enforced by ambient-agentctl, with explicit opt-out for advanced use cases. Platform admins MUST enforce PodSecurityStandards at the restricted level.

6. ConfigMap Mounting Details Missing

Location: Lines 64-88, 166-171 (ConfigMap creation and mounting)

Issue: The ADR shows ConfigMap creation but doesn't address:

  1. Size limits: ConfigMaps have a 1MB limit. What if the merged template exceeds this?
  2. Update propagation: How long does it take for ConfigMap updates to reflect in the agent pod?
  3. Immutability: Should the ConfigMap be immutable once created?

Recommendation: Add ConfigMap lifecycle clarification (immutability, size limits, update behavior).

7. Error Handling and Failure Modes Not Documented

Location: Throughout implementation sections

Issue: The ADR doesn't address failure scenarios:

  1. What if workspace pod fails to create? (image pull error, resource quota exceeded)
  2. What if workspace pod crashes? (OOM kill, segfault in user code)
  3. What if PVC is full? (/workspace out of disk space)
  4. What if kubectl/ambient-agentctl fails? (network error, RBAC denial)

Recommendation: Add "Error Handling" subsection documenting failure modes and graceful degradation.


🔵 Minor Issues

8. Inconsistent YAML Formatting

Location: Lines 65-88, 132-184, 187-215, etc.

Issue: Some YAML blocks use comments inconsistently, and indentation varies. Recommend consistent comment style.

9. Alternatives Section Could Be More Concise

Location: Lines 668-931 (Alternatives Considered)

Issue: The "Purely Ephemeral Pods" alternative (lines 762-931) is 169 lines of detailed Python code. Consider collapsing detailed implementation into expandable section.

10. yq Dependency Not Mentioned in Requirements

Location: Lines 98-106 (Agent uses template)

Issue: The example shows using yq (line 99), but yq is not listed in "Agent Container Requirements" (line 314). Clarify if this is pseudocode or actual dependency.

11. Potential Name Collision for Default Workspace Pod

Location: Line 252 (Default workspace pod naming)

Issue: Default workspace pod is ${SESSION_NAME}-ws. If a user manually creates a pod with this name, ambient-agentctl exec could exec into the wrong pod.

Recommendation: ambient-agentctl should verify pod ownership via OwnerReferences before exec.

12. No Discussion of Resource Limits on Workspace Pods

Location: User Configuration Examples (lines 617-666)

Issue: Examples show resource limits but don't discuss default limits or why they're important. Add resource management guidance with recommended defaults.


Positive Highlights

  1. Excellent Alternatives Analysis: Thoroughly evaluates rejected approaches with clear rationale. The "Purely Ephemeral Pods" section is particularly detailed and educational.

  2. Comprehensive Configuration System: The cascading configuration (Platform → ProjectSettings → AgenticSession) is well-designed and clearly documented (lines 41-127).

  3. Security Awareness: The ADR acknowledges security concerns (lines 378-410) and recommends SecurityContext settings. The namespace-per-session future work shows awareness of isolation limitations.

  4. Practical Examples: Usage examples (lines 932-987) demonstrate real-world workflows and help readers understand the UX.

  5. Implementation Guidance: The "Files to Create/Modify" section (lines 988-1091) provides clear implementation roadmap for developers.

  6. Clear Decision: The decision is well-articulated with sound reasoning. The ambient-agentctl approach is pragmatic and aligns with Kubernetes patterns.

  7. Consistent with Existing Architecture: The agent-spawned pod pattern aligns with ADR-0001 (Kubernetes-native) and maintains the separation of concerns established in the platform.


Recommendations

Before Merging (Blockers)

  1. Address session isolation security (Critical Issue Outcome: Reduce Refinement Time with agent System #1): Either implement namespace-per-session now, or clearly document security limitations and acceptable use cases
  2. Clarify RBAC namespace scope (Critical Issue Epic: RAT Architecture & Design #2): Fix terminology confusion between "session namespace" and "project namespace"
  3. Add Validation section (Critical Issue Epic: Data Source Integration #3): Define acceptance criteria and success metrics
  4. Add Deciders/Technical Story metadata (Major Issue Epic: AI Agent Development #4): Follow ADR template format

Strongly Recommended (Pre-Acceptance)

  1. Strengthen security defaults (Major Issue Epic: Jira Integration & Workflow #5): Make SecurityContext mandatory, not optional
  2. Document ConfigMap lifecycle (Major Issue Epic: Testing & Validation #6): Clarify immutability and update behavior
  3. Add error handling section (Major Issue Test: Automation Workflow Validation #7): Document failure modes and graceful degradation

Nice-to-Have (Post-Acceptance)

  1. Fix YAML formatting consistency (Minor Issue Test: Updated Workflow Validation #8)
  2. Collapse detailed alternative implementations (Minor Issue Bump actions/add-to-project from 0.5.0 to 1.0.2 #9)
  3. Clarify yq dependency (Minor Issue Bump actions/checkout from 4 to 5 #10)
  4. Add pod ownership verification (Minor Issue Bump actions/github-script from 6 to 7 #11)
  5. Add resource limits guidance (Minor Issue Add vTeam shared Claude Code configuration with hooks-based enforcement #12)

Final Assessment

This is a high-quality ADR with strong technical reasoning and comprehensive coverage. The proposed architecture is sound and well-aligned with Kubernetes patterns.

However, the session isolation security model needs to be resolved before this can be accepted. The current "soft isolation" approach creates a significant security gap that must either be:

  1. Fixed (implement namespace-per-session), OR
  2. Acknowledged (document as known limitation with clear guidance on acceptable use)

Once the critical security concerns are addressed and the Validation section is added, this ADR will be ready for acceptance.

Recommendation: Request Changes (address Critical Issues #1-3 and Major Issue #4, then re-review)

@cgwalters
Copy link
Author

I'm continuing to iterate on this and experiment with a PoC implementation, I'll post an update soon

Increase flexibility by allowing users to configure
the container used by the agent's tools (separate from the container
running the agent itself).

Assisted-by: Claude Code (Opus 4.5)
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Claude Code Review

Summary

This PR implements ADR-0006: Workspace Container Architecture, introducing a significant architectural enhancement that separates agent execution from the workspace environment. The implementation enables users to provide custom container images while maintaining security through a session-proxy sidecar pattern.

Overall Assessment: Strong implementation with excellent security considerations. The code follows most established patterns, but there are several areas requiring attention before merge.

Issues by Severity

🚫 Blocker Issues

1. Backend Handler Missing User Token Authentication (workspace_container.go)

Location: components/backend/handlers/workspace_container.go:39-43, 112-115

Issue: Both GetWorkspaceContainerSettings and UpdateWorkspaceContainerSettings use user-scoped clients correctly, but they don't check if the client creation succeeded before using it.

// CURRENT CODE (lines 39-43)
_, reqDyn := GetK8sClientsForRequest(c)
if reqDyn == nil {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing authentication token"})
    return
}

Why it's a blocker: According to security standards (security-standards.md:13-18), we must check both clients. The current code only checks reqDyn but discards reqK8s with _. If GetK8sClientsForRequest returns (nil, nil), this will pass the check but fail later.

Fix: Check both return values or assign reqK8s to validate token authentication:

reqK8s, reqDyn := GetK8sClientsForRequest(c)
if reqK8s == nil || reqDyn == nil {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing authentication token"})
    return
}

Reference: .claude/context/backend-development.md:18-25, CLAUDE.md:341-345


2. Unsafe Type Assertions in Backend Handler (workspace_container.go:62-91)

Location: components/backend/handlers/workspace_container.go:62, 72-90

Issue: Direct type assertions without checking, violating Critical Rule #4:

// Line 62 - UNSAFE
spec, _, _ := unstructured.NestedMap(obj.Object, "spec")

// Line 72-90 - Direct type assertion without checking 'ok'
if image, ok := wcMap["image"].(string); ok {
    settings.Image = image
}

Why it's a blocker: CLAUDE.md:358-362 explicitly forbids this pattern. If spec is nil, line 63 will panic when accessing wcMap.

Fix: Check the found return value:

spec, found, err := unstructured.NestedMap(obj.Object, "spec")
if !found || err != nil {
    c.JSON(http.StatusOK, WorkspaceContainerSettings{})
    return
}

Reference: CLAUDE.md:358-362, .claude/patterns/error-handling.md:62-85


🔴 Critical Issues

3. Session Proxy Missing Rate Limiting

Location: components/runners/session-proxy/pkg/proxy/handler.go:88-151

Issue: The /exec endpoint has no rate limiting or concurrent request limits. A malicious agent could spawn unlimited parallel commands, exhausting cluster resources.

Recommendation:

  • Add rate limiting middleware (e.g., 10 concurrent execs per session)
  • Implement request timeout (currently only has ReadTimeout: 10s for receiving request, but no overall timeout)
  • Add metrics for monitoring

Security Impact: Medium - Could lead to resource exhaustion attacks.


4. Frontend Component Violates Type Standards

Location: components/frontend/src/components/workspace-sections/settings-section.tsx:52

Issue: Using as const assertion with array instead of proper type definition:

const FIXED_KEYS = useMemo(() => ["ANTHROPIC_API_KEY", ...] as const, []);

Why it's critical: Frontend standards require type over interface (DESIGN_GUIDELINES.md), and using useMemo for constants is unnecessary overhead.

Fix: Define as a module-level constant type:

type FixedKey = 
  | "ANTHROPIC_API_KEY" 
  | "GIT_USER_NAME" 
  | "GIT_USER_EMAIL"
  | "GITHUB_TOKEN"
  | "JIRA_URL"
  | "JIRA_PROJECT"
  | "JIRA_EMAIL"
  | "JIRA_API_TOKEN"
  | "GITLAB_TOKEN"
  | "GITLAB_INSTANCE_URL";

const FIXED_KEYS: readonly FixedKey[] = [
  "ANTHROPIC_API_KEY",
  // ... rest
];

Reference: components/frontend/DESIGN_GUIDELINES.md:122-140


5. Operator Missing Error Context in Status Updates

Location: components/operator/internal/handlers/sessions.go:288-289

Issue: When Vertex secret copy fails, the error message doesn't include the actual error:

if err := copySecretToNamespace(copyCtx, ambientVertexSecret, sessionNamespace, currentObj); err != nil {
    return fmt.Errorf("failed to copy %s secret from %s to %s (CLAUDE_CODE_USE_VERTEX=1): %w", 
        types.AmbientVertexSecretName, operatorNamespace, sessionNamespace, err)
}

Good: Using %w for error wrapping.

But: The session status should be updated before returning the error (Pattern 2 in error-handling.md:146-160).

Fix: Update session status to reflect error state:

if err := copySecretToNamespace(...); err != nil {
    updateAgenticSessionStatus(sessionNamespace, name, map[string]interface{}{
        "phase":   "Error",
        "message": fmt.Sprintf("Failed to copy Vertex secret: %v", err),
    })
    return fmt.Errorf("failed to copy %s secret: %w", types.AmbientVertexSecretName, err)
}

Reference: .claude/patterns/error-handling.md:146-170


🟡 Major Issues

6. Operator Session Handler Extremely Long (sessions.go)

Location: components/operator/internal/handlers/sessions.go

Issue: The handleAgenticSessionEvent function is likely 500+ lines based on the diff. This violates the principle of keeping functions focused and testable.

Impact: Difficult to review, test, and maintain. Makes it hard to isolate bugs.

Recommendation: Refactor into smaller functions:

  • handleSessionStopped()
  • ensureSessionWorkspace()
  • copyPlatformSecrets()
  • createRunnerJob()
  • createWorkspacePod()

Reference: Go best practices recommend functions under 50-100 lines.


7. Frontend Component Exceeds Line Limit

Location: components/frontend/src/components/workspace-sections/settings-section.tsx (695 lines)

Issue: Component is 695 lines, far exceeding the 200-line guideline.

Fix: Extract sub-components:

  • AnthropicSection.tsx
  • GitHubIntegrationSection.tsx
  • JiraIntegrationSection.tsx
  • GitLabIntegrationSection.tsx
  • WorkspaceContainerSection.tsx
  • CustomEnvVarsSection.tsx

Reference: components/frontend/DESIGN_GUIDELINES.md:168-177


8. Missing Security Context Validation in Session Proxy

Location: components/runners/session-proxy/pkg/proxy/handler.go:179-215

Issue: The proxy executes commands in the workspace pod but doesn't verify the pod's SecurityContext before executing. A misconfigured workspace pod could run as root.

Recommendation: Add validation in findWorkspacePod to ensure the pod has appropriate SecurityContext:

// Check pod security context
if pod.Spec.SecurityContext == nil || 
   pod.Spec.SecurityContext.RunAsNonRoot == nil || 
   !*pod.Spec.SecurityContext.RunAsNonRoot {
    return "", fmt.Errorf("workspace pod %s does not have required SecurityContext", pod.Name)
}

Security Impact: Medium - Defense in depth against privilege escalation.


🔵 Minor Issues

9. Inconsistent Comment Style in Session Proxy

Location: components/runners/session-proxy/pkg/proxy/handler.go:1-11

Issue: Package comment uses single-line style, but multi-line would be more appropriate for the detailed description.

Fix: Use standard Go doc comment format:

// Package proxy implements the streaming exec API for workspace containers.
//
// The proxy runs as a sidecar in the runner pod and handles kubectl exec
// operations to workspace pods, providing streaming command execution over HTTP.
package proxy

10. Magic Numbers in Workspace MCP Server

Location: components/runners/claude-code-runner/mcp-servers/workspace.py:53-55

Issue: Hardcoded timeout constants without explanation:

DEFAULT_TIMEOUT = 300
MAX_TIMEOUT = 1800  # 30 minutes max

Recommendation: Make configurable via environment variables with fallbacks:

DEFAULT_TIMEOUT = int(os.environ.get("WORKSPACE_DEFAULT_TIMEOUT", "300"))
MAX_TIMEOUT = int(os.environ.get("WORKSPACE_MAX_TIMEOUT", "1800"))

11. Frontend React Query Hook Missing Error Handling

Location: components/frontend/src/services/queries/use-workspace-container.ts:16-28

Issue: The mutation doesn't have onError handler. While the component handles errors, it's better practice to have centralized error logging.

Recommendation: Add error logging:

onError: (error) => {
  console.error('Failed to update workspace container settings:', error);
},

Positive Highlights

Excellent ADR Documentation: ADR-0006 is comprehensive, well-structured, and provides clear rationale for architectural decisions. The "Alternatives Considered" section is particularly valuable.

Security-First Design: The session-proxy sidecar pattern with privilege separation (token isolation, localhost-only API) demonstrates strong security awareness.

Proper Owner References: Operator correctly sets OwnerReferences on workspace pods and PVCs for automatic cleanup (sessions.go:232-242).

Frontend Follows React Query Patterns: New hooks (useWorkspaceContainerSettings, useUpdateWorkspaceContainerSettings) correctly use React Query with proper cache invalidation.

Type-Safe Unstructured Access in Most Places: Operator code generally uses unstructured.NestedMap correctly with three-value returns (e.g., sessions.go:341-344).

Streaming Implementation: Session proxy's streaming exec with chunked transfer encoding provides excellent real-time feedback (handler.go:125-150).

Comprehensive CRD Schema: ProjectSettings CRD extension includes proper validation and optional fields for workspace container customization (projectsettings-crd.yaml).

Backward Compatibility: Implementation maintains compatibility with existing sessions while enabling new workspace container mode.

Recommendations

High Priority (Before Merge)

  1. Fix blocker issues Outcome: Reduce Refinement Time with agent System #1 and Epic: RAT Architecture & Design #2 - Backend handler authentication and type safety
  2. Add rate limiting to session proxy - Prevent resource exhaustion
  3. Refactor operator session handler - Break into smaller functions (target: <100 lines each)
  4. Add integration tests - Test workspace pod creation, proxy exec, and error scenarios

Medium Priority (Can Address in Follow-up)

  1. Refactor frontend settings component - Extract sub-components
  2. Add metrics and monitoring - Track exec requests, failures, timeouts
  3. Document operator RBAC requirements - Explain pods/exec permission in ADR-0006
  4. Add validation for resource limits - Ensure CPU/memory values are valid K8s quantities

Nice to Have

  1. Add devcontainer auto-detection - As outlined in ADR-0006 Future Directions
  2. Consider namespace-per-session - For stricter isolation (also in Future Directions)
  3. Add Prometheus metrics - For observability of workspace container usage

Final Verdict

Overall: 🟡 Needs Work Before Merge

This is a well-architected feature with strong security foundations, but the blocker issues must be addressed to comply with established backend development standards. The implementation demonstrates good understanding of K8s patterns and streaming APIs.

Estimated effort to address blockers: 2-3 hours

Recommendation: Request changes for blockers #1-2, strongly recommend addressing critical issues #3-5 before merge.


Review conducted by Claude Code following standards in CLAUDE.md, .claude/context/, and .claude/patterns/
Reviewed files: 40 files changed (+3017, -26)
Focus areas: Security, Backend patterns, Frontend standards, Operator reconciliation


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@@ -0,0 +1,7 @@
apiVersion: v1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry, some minikube-related stuff leaked into this PR too beacuse I'd been developing on top of #367 first. Will fix.

@@ -0,0 +1,149 @@
# Claude Agent SDK Patches

This directory contains patches applied to the `claude-agent-sdk` package to ensure compliance with ADR-0006 (Agent Injection Architecture).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was good for PoC but now we need to actually try upstreaming these changes without patching like this.

@@ -0,0 +1,233 @@
#!/bin/bash
# Minikube deployment script for Ambient Code Platform
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove this

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Claude Code Review - PR #364

Summary

ADR-0006: Workspace Container Architecture implements separation of agent execution from tool execution environment (~3,000 lines across backend, frontend, operator, session-proxy).

Assessment: Well-architected with strong security isolation, BUT has critical security gaps requiring fixes before merge.

Blocker Issues (Must Fix)

  1. Missing User Token Auth (workspace_container.go:179,188) - Uses user-scoped client for CR writes instead of backend SA. Violates Critical Rule Outcome: Reduce Refinement Time with agent System #1.

  2. No RBAC Check (workspace_container.go:97-201) - Missing SelfSubjectAccessReview before ProjectSettings update.

  3. Unvalidated Container Images (line 147-149) - Supply chain risk. Need registry allowlist and digest verification.

  4. No Resource Limit Validation (lines 150-167) - Could crash operator. Must use resource.ParseQuantity.

  5. Unchecked Type Assertions (lines 62,77) - Ignoring error returns from NestedMap. Violates Critical Rule Epic: AI Agent Development #4.

  6. No Rate Limiting (session-proxy/handler.go) - /exec endpoint vulnerable to DoS. Add rate limiter.

  7. Missing Query Safeguards (use-workspace-container.ts) - Lacks enabled guards and staleTime.

  8. Missing OwnerReferences - Must verify workspace pod has OwnerReference to AgenticSession.

Positive Highlights

✅ Excellent two-pod architecture with localhost proxy
✅ Strong privilege separation (runner has no token)
✅ Comprehensive ADR with security analysis
✅ Real-time streaming implementation
✅ Proper use of Shadcn UI and React Query

Recommendation

REQUEST CHANGES - Do not merge until blockers #1-8 resolved. Architecture is sound but implementation needs hardening.

See full review with code examples and patterns: [detailed analysis above]


Review by Claude Code (Sonnet 4.5) using memory system context


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Implement workspace container architecture per ADR-0006:

- Add session-proxy sidecar for streaming command execution
- Add workspace MCP server for agent command interface
- Add workspace container settings to ProjectSettings CRD
- Add backend handlers for workspace container configuration
- Add frontend UI for workspace container settings
- Update operator to create workspace pods with user-configurable images

Security improvements from review:
- Add rate limiting to session proxy (max concurrent execs)
- Add exec timeout (configurable via environment)
- Fix authentication checks in backend handlers
- Add proper error handling for unstructured K8s access
- Update session status on Vertex secret copy failure

Assisted-by: Claude Code (Opus 4.5)
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Claude Code Review

Summary

This PR implements ADR-0006, introducing workspace container architecture with agent isolation. The implementation spans backend API endpoints, frontend UI, operator logic, and a new session-proxy sidecar. The code quality is generally strong with good adherence to established patterns, but there are several critical security and architectural concerns that should be addressed.

Overall Assessment: The PR demonstrates solid architectural thinking and mostly follows the project's established patterns. However, there are blocker-level security issues around RBAC validation and service account usage that must be fixed before merge.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Blocker Issues (Must Fix Before Merge)

1. Missing RBAC Validation in Backend Handlers

Location: components/backend/handlers/workspace_container.go:116-120, 39-43

Both GetWorkspaceContainerSettings and UpdateWorkspaceContainerSettings use user-scoped clients but do not perform RBAC checks before operations. Without RBAC checks, any authenticated user can read/modify workspace settings for projects they don't have access to.

Required: Add SelfSubjectAccessReview checks before Get/Update operations per backend-development.md patterns.

2. User-Scoped Client Used for CR Write Operations

Location: components/backend/handlers/workspace_container.go:182-198

The handler uses user-scoped client (reqDyn) to create/update ProjectSettings CRs. Per k8s-client-usage.md Pattern 2, CR writes should use the backend service account after RBAC validation.

Required: Switch to DynamicClient (backend SA) for Create/Update operations after validating user permissions.

3. Missing OwnerReferences on Workspace Pod

Location: Operator session handler

Per CLAUDE.md Critical Rule #5, all child resources must have OwnerReferences set for automatic cleanup. Without OwnerReferences, workspace pods won't be cleaned up when the AgenticSession is deleted, causing resource leaks.

Required: Verify OwnerReferences are set on workspace pods in operator.

@cgwalters
Copy link
Author

OK, there's now conflicts with #367 on git main (which I was using to develop this). I may try testing it this out instead on a full OCP (or maybe https://github.com/minc-org/minc instead of minikube), but anyways this is probably far enough along that folks can comment on the overall architecture idea.

Does anyone think it might make sense to just always have a complete "sub-namespace" per workspace where the agent can run arbitrary pods by default say? Or maybe we go all the way to requiring https://github.com/projectcapsule/capsule (feels a bit like overkill by default, but we should certainly make it easy to integrate with that, as well as being easily able to provision completely distinct cloud resources that are lifecycle-bound; lots of precedent in that space).

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Critical Issues (Should Fix Before Merge)

4. Type Assertions Without Safety Checks

Location: workspace_container.go:77, 84-95

Direct type assertions like wcMap["image"].(string) violate CLAUDE.md Critical Rule #4. If the type is wrong, this will panic and crash the entire backend process.

Fix: Use unstructured.NestedString() helpers instead of direct type assertions.

5. Frontend: any Type Usage

Location: components/frontend/src/app/api/projects/[name]/workspace-container/route.ts:10, 16, 47

Uses .catch(() => ({})) which implicitly types the result as any. This violates frontend-development.md Critical Rule #1: "Zero any Types".

Fix: Use explicit type annotation and try/catch blocks.

6. Session Proxy: Potential Command Injection

Location: Session proxy pkg/proxy/handler.go

If the session proxy constructs shell commands using string concatenation with user input, it's vulnerable to command injection.

Verify: Commands are passed as arrays/slices, not concatenated strings. User-provided paths are validated/sanitized.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Major Issues (Address in Follow-Up)

7. Frontend: Manual State Synchronization

Location: settings-section.tsx:96-104

Manual useEffect for syncing API data to local state is an anti-pattern. React Query should be the single source of truth. This increases complexity and is prone to stale state bugs.

8. Hardcoded String Literals

Location: workspace_container.go:46

The singleton ProjectSettings CR name "projectsettings" is hardcoded in multiple locations. Should be a constant.

9. Missing Loading State for Workspace Settings

Location: settings-section.tsx:542-646

The workspace container section doesn't show a loading state while fetching settings. Per DESIGN_GUIDELINES.md, all data operations should handle loading states.

Minor Issues

  • Incomplete CRD validation for resource quantities
  • Unused Switch component added but not used
  • Missing context in operator logging
  • Could standardize error message constants

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Positive Highlights

  1. Excellent Architecture Documentation: ADR-0006 is thorough, well-structured, and clearly explains the security model. The threat mitigation table and mermaid diagrams are particularly helpful.

  2. Type-Safe Frontend Types: The WorkspaceContainerSettings type definition is clean and properly documented with comments explaining optional fields.

  3. Consistent API Patterns: The new endpoints follow the established /api/projects/:projectName/resource pattern and use the same middleware chain.

  4. Shadcn UI Usage: Frontend correctly uses Shadcn components (Input, Label, Button) rather than custom UI.

  5. React Query Integration: Frontend properly uses query and mutation hooks with cache invalidation, following the established patterns.

  6. Error Handling Structure: Backend handlers follow the established error handling patterns (IsNotFound checks, logging with context, generic user messages).

  7. CRD Schema: The ProjectSettings CRD extension is well-structured with clear descriptions and proper nesting.

  8. Security Consciousness: The ADR demonstrates strong security thinking (privilege separation, token isolation, no Bash tool).

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

Recommendations

Must Fix Before Merge (Blockers)

  1. Add RBAC validation to both workspace_container.go handlers (SelfSubjectAccessReview checks)
  2. Switch to backend service account (DynamicClient) for CR writes in UpdateWorkspaceContainerSettings
  3. Verify OwnerReferences are set on workspace pods in operator

Should Fix Before Merge (Critical)

  1. Replace type assertions with unstructured.Nested* helpers in workspace_container.go
  2. Fix any types in frontend route.ts error handling
  3. Review session proxy for command injection vulnerabilities

Testing Recommendations

  1. RBAC Testing: Add integration test verifying users without projectsettings permissions get 403
  2. Type Safety: Add unit test with malformed ProjectSettings CR to ensure no panics
  3. E2E Test: Add Cypress test for workspace container settings flow
  4. Security Test: Verify session proxy command injection is not possible

Conclusion

This is a significant architectural enhancement with strong design thinking. The ADR is excellent and the implementation follows most project patterns. However, the critical security issues around RBAC validation and service account usage must be fixed before merge to maintain the security guarantees established in the codebase.

Recommendation: Request changes for blockers #1, #2, and verification of #3. The rest can be addressed in follow-up PRs if time-sensitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants