-
Notifications
You must be signed in to change notification settings - Fork 33
[Amber] Fix: [Amber] Long running sessions have ServiceAccount tokens expire #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Amber] Fix: [Amber] Long running sessions have ServiceAccount tokens expire #446
Conversation
Long-running sessions were failing due to ServiceAccount token expiration because tokens were injected as environment variables at pod startup and never refreshed, even though the operator was refreshing the Secret. Changes: - Operator: Mount runner token Secret as volume instead of env var - Operator: Inject BOT_TOKEN_PATH env var pointing to mounted token file - Runner: Read token from BOT_TOKEN_PATH file on each connection/reconnection - Runner: Fall back to BOT_TOKEN env var for backward compatibility - Runner: Improved error messages for token authentication issues The operator already refreshes tokens every 45 minutes via ensureFreshRunnerToken(). Now the runner can read the refreshed token from the mounted Secret volume without requiring pod restart. Fixes #445 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
| // Prefer annotated secret name; fallback to deterministic name | ||
| secretName := "" | ||
| if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok { | ||
| if anns, ok := meta["annotations"].(map[string]interface{}); ok { | ||
| if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" { | ||
| secretName = strings.TrimSpace(v) | ||
| } | ||
| } | ||
| } | ||
| if secretName == "" { | ||
| secretName = fmt.Sprintf("ambient-runner-token-%s", name) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude any risks to deleting this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses the issue of ServiceAccount tokens expiring in long-running sessions by implementing dynamic token refresh. Instead of reading tokens once from environment variables at startup, the runner now reads tokens from a file path that Kubernetes updates automatically when tokens are refreshed.
- Changes token handling from static
BOT_TOKENenvironment variable to file-basedBOT_TOKEN_PATH - Implements volume mount for runner token secret to enable dynamic token refresh without pod restart
- Maintains backward compatibility by falling back to
BOT_TOKENifBOT_TOKEN_PATHis not set
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| components/runners/runner-shell/runner_shell/core/transport_ws.py | Adds logic to read authentication token from file path (BOT_TOKEN_PATH) with fallback to environment variable (BOT_TOKEN), and improves error messages for authentication failures |
| components/operator/internal/handlers/sessions.go | Mounts runner token secret as volume at /app/runner-token and changes environment variable from BOT_TOKEN to BOT_TOKEN_PATH pointing to the mounted token file |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
components/runners/runner-shell/runner_shell/core/transport_ws.py
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Claude Code ReviewSummaryThis PR addresses service account token expiration in long-running sessions by implementing a volume-based token refresh mechanism. The solution transitions from static environment variable injection ( Overall Assessment: ✅ Approve with minor suggestions The implementation follows established patterns and security standards. The approach is architecturally sound and maintains backward compatibility. Issues by Severity🚫 Blocker IssuesNone - No blocking issues found. 🔴 Critical IssuesNone - No critical issues found. 🟡 Major Issues1. Missing Test Coverage for Token Refresh LogicLocation: Issue: No tests verify the new token volume mount and file reading logic. Impact: Token refresh mechanism is untested. Regressions could cause authentication failures in long-running sessions. Recommendation:
Example test structure: // components/operator/internal/handlers/sessions_test.go
func TestHandleAgenticSessionEvent_RunnerTokenVolumeMount(t *testing.T) {
// Test that Job includes runner-token volume and volumeMount
// Test annotation-based secret name override
// Test fallback to default secret name
}2. Error Handling for Token File Read Should Be More RobustLocation: Issue: File read errors are logged as warnings but don't prevent connection attempts with empty token. Current behavior: try:
with open(token_path, "r", encoding="utf-8") as f:
token = f.read().strip()
except Exception as e:
logger.warning(f"Failed to read token from {token_path}: {e}")
# Continues with empty token, will get 401 laterRecommendation: Consider failing fast if if token_path:
try:
with open(token_path, "r", encoding="utf-8") as f:
token = f.read().strip()
logger.info(f"Read token from {token_path} (len={len(token)})")
except FileNotFoundError:
logger.error(f"BOT_TOKEN_PATH set to {token_path} but file not found")
raise
except Exception as e:
logger.error(f"Failed to read token from {token_path}: {e}")
raiseRationale: If 🔵 Minor Issues3. Token Redaction Missing in New Log StatementLocation: Issue: New log statement logs token path but doesn't redact token length. Current: logger.info(f"Read token from {token_path}")Recommendation (following security-standards.md patterns): logger.info(f"Read token from {token_path} (len={len(token)})")4. Inconsistent Comment Style in Go CodeLocation: Issue: Comment exceeds recommended line length and could be more concise. Current: // Inject BOT_TOKEN_PATH pointing to mounted secret volume
// Token is mounted from runnerTokenSecretName at /app/runner-token
// This allows the runner to read refreshed tokens without pod restartRecommendation: // BOT_TOKEN_PATH points to mounted secret volume for dynamic token refresh
// Mounted from runnerTokenSecretName at /app/runner-token5. Python Docstring Could Be More DescriptiveLocation: Issue: Docstring doesn't mention new token refresh capability. Recommendation: async def connect(self):
"""
Connect to WebSocket endpoint with authentication.
Supports dynamic token refresh via BOT_TOKEN_PATH (file-based)
or static BOT_TOKEN (env var, backward compatibility).
"""Positive Highlights✅ Excellent backward compatibility - Graceful fallback from ✅ Security best practices followed - Token mounted as volume (read-only), follows Kubernetes secret rotation patterns. ✅ Clear error messages - Helpful 401 error hints distinguish between missing token vs. unreadable file. ✅ Follows established patterns:
✅ Type-safe unstructured access - Uses ✅ Proper error handling - Non-fatal errors logged with context, follows error-handling.md patterns. RecommendationsImmediate Action Items (Before Merge)
Follow-up Items (Post-Merge)
Security Review✅ Compliant with Security StandardsToken Handling (security-standards.md):
Container Security:
RBAC:
Architecture AlignmentFollows ADR-0002 (User Token Authentication):
Follows operator patterns (CLAUDE.md:636-771):
Follows error handling patterns (error-handling.md):
Code Quality ChecksBackend/Operator (Go)
Runner (Python)
Final VerdictRecommendation: ✅ Approve after addressing test coverage This PR successfully solves the token expiration problem with a clean, Kubernetes-native approach. The implementation quality is high and follows project standards. Required before merge:
Optional improvements:
🤖 Review completed by Claude Code (Sonnet 4.5) 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Claude Code ReviewSummaryThis PR fixes long-running session token expiration by migrating from environment variable injection to volume-mounted secrets. The change enables dynamic token refresh without pod restarts, addressing a critical operational issue for sessions exceeding ServiceAccount token TTL. Overall Assessment: ✅ APPROVE with minor recommendations The implementation follows established patterns, maintains backward compatibility, and aligns with CLAUDE.md security standards. The approach is architecturally sound and consistent with Kubernetes best practices. Issues by Severity🚫 Blocker IssuesNone - No blocking issues found. 🔴 Critical IssuesNone - All critical patterns followed correctly. 🟡 Major Issues1. Missing Error Handling for Volume Mount FailureFile: The volume mount configuration doesn't handle the case where the secret might not exist yet or is inaccessible. Current: {
Name: "runner-token",
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: runnerTokenSecretName,
},
},
},Recommendation: {
Name: "runner-token",
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: runnerTokenSecretName,
Optional: boolPtr(false), // Fail fast if secret missing
},
},
},This makes the failure explicit rather than waiting for pod startup failure. 2. Token File Read Error Handling Could Be More RobustFile: The token file read logs a warning but continues silently. For long-running sessions, this could lead to subtle auth failures. Current behavior: Logs warning, falls back to empty token if token_path:
max_retries = 3
for attempt in range(max_retries):
try:
with open(token_path, "r", encoding="utf-8") as f:
token = f.read().strip()
logger.info(f"Read token from {token_path}")
break
except FileNotFoundError:
if attempt == max_retries - 1:
logger.error(f"Token file not found after {max_retries} attempts: {token_path}")
raise
await asyncio.sleep(1) # Brief wait for file to appear
except Exception as e:
logger.error(f"Failed to read token from {token_path}: {e}")
raise🔵 Minor Issues1. Potential Race Condition in Annotation ReadingFile: Uses direct map access instead of type-safe unstructured helpers, violating CLAUDE.md rule #4. Current: if annotations := currentObj.GetAnnotations(); annotations != nil {
if v, ok := annotations["ambient-code.io/runner-token-secret"]; ok && strings.TrimSpace(v) != "" {
runnerTokenSecretName = strings.TrimSpace(v)
}
}Per CLAUDE.md (line 361-365):
Recommended: // Use type-safe access
annotations := currentObj.GetAnnotations()
if v, ok := annotations["ambient-code.io/runner-token-secret"]; ok && strings.TrimSpace(v) != "" {
runnerTokenSecretName = strings.TrimSpace(v)
}Note: Current implementation is actually acceptable since 2. Log Message Formatting InconsistencyFile: Uses f-string for path logging, but consider redacting sensitive paths. Current: logger.info(f"Read token from {token_path}")Recommended: # Redact sensitive path info if needed
logger.info(f"Read token from volume mount (path length: {len(token_path)})")3. Missing Volume Mount DocumentationThe PR changes the fundamental token delivery mechanism but doesn't update inline documentation about the new pattern. Recommendation: Add a comment block above the volume mount explaining the token refresh mechanism: // Mount runner token secret as volume for dynamic token refresh
// The backend operator periodically refreshes tokens in this secret,
// allowing long-running sessions to continue without pod restarts.
// Token file: /app/runner-token/k8s-token
{
Name: "runner-token",
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: runnerTokenSecretName,
},
},
},Positive Highlights✅ Excellent Security Practices
✅ Proper Resource Lifecycle Management
✅ Code Quality
✅ Backward Compatibility
✅ Testing ConsiderationsThe change is well-isolated and testable:
RecommendationsPriority 1: Address Before Merge
Priority 2: Consider for Follow-up PR
Priority 3: Nice-to-Have
Pre-Commit Checklist Verification✅ Authentication: N/A (operator code, not user-facing API) Testing Recommendations# Manual testing checklist
1. Deploy with new image
2. Create long-running session (>1 hour)
3. Watch secret for token refresh
4. Verify session continues after token refresh
5. Test backward compatibility with BOT_TOKEN env var
6. Verify proper error messages when token file missing
7. Check operator logs for volume mount errorsFinal VerdictAPPROVED ✅ This is a well-implemented fix that follows project standards and solves a real operational problem. The recommendations above are improvements, not blockers. The code is production-ready with the Priority 1 change applied. Estimated Risk: LOW 🤖 Generated with Claude Code 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Automated Fix by Amber Agent
This PR addresses issue #445 using the Amber background agent.
Changes Summary
Pre-merge Checklist
Reviewer Notes
This PR was automatically generated. Please review:
🤖 Generated with Amber Background Agent
Closes #445