feat(sandbox): code-execution sandbox + runnable artifacts#1727
feat(sandbox): code-execution sandbox + runnable artifacts#1727larryro wants to merge 88 commits into
Conversation
📝 WalkthroughWalkthroughAdds runnable artifact types (python/node), a new Convex execution workflow (sandboxExecutions, mutations, queries), an internal action that calls a sandbox spawner, and a Bun-based spawner service that runs per-call Docker sandboxes. Persists run config on artifacts, introduces artifact_run tool and updates artifact_create/edit, adds UI components to show run state/files, integrates sandbox services into compose/CLI/env/templates, provides a doctor command and tests, and implements network/volume/cleanup helpers and runtime images. Estimated code review effort🎯 5 (Critical) | ⏱️ ~150 minutes ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 22
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tools/cli/src/lib/actions/deploy.ts (1)
47-48:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInclude sandbox network in dry-run logging for parity.
ensureInfrastructurenow creates two networks, but dry-run only reports one. Add a dry-run line for the sandbox network so operators see the full infra plan.Also applies to: 59-63
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/cli/src/lib/actions/deploy.ts` around lines 47 - 48, Dry-run logging in deploy.ts’s ensureInfrastructure path only reports the internal network; add a matching dry-run log for the sandbox network so the operator sees both networks. Update the logger.info call(s) that currently print `${prefix}Would ensure network: ${getProjectId()}_internal` to also emit `${prefix}Would ensure network: ${getProjectId()}_sandbox` (do this in the earlier dry-run return block and the later dry-run block around lines 59-63), referencing the same prefix, getProjectId() and logger.info used in ensureInfrastructure.services/platform/app/features/chat/components/canvas/canvas-pane.tsx (1)
153-161:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLocalize runnable type labels instead of hardcoding English.
Python (sandbox)andNode (sandbox)should be translation keys so the canvas badge remains localized like the rest of the UI.As per coding guidelines: “No hardcoded user-facing strings; always use the translation hook.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/platform/app/features/chat/components/canvas/canvas-pane.tsx` around lines 153 - 161, Replace the hardcoded English labels for runnable types in TYPE_LABELS so they use the translation hook instead of literal strings: update the entries for the keys "python_runnable" and "node_runnable" in the TYPE_LABELS Record<CanvasContentType, string> to call the i18n translate function (use the existing t hook or import useTranslation/t as used in this component) with distinct keys like "canvas.type.python_runnable" and "canvas.type.node_runnable"; also add those keys (with appropriate localized values) to the locale resource files so the badge text is translated at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@knip.config.ts`:
- Around line 56-57: The knip config's "entry" array currently only lists test
globs ('src/**/*.test.ts'), which can cause live runtime files like
src/server.ts to be treated as unused; update the knip.config.ts "entry" to
include the application's runtime entrypoint (e.g., add 'src/server.ts' to the
entry array) so knip analyzes the live service code when computing usage for the
project.
In
`@services/platform/app/features/chat/components/canvas/canvas-runnable-code-renderer.tsx`:
- Around line 156-157: The current cast using "as RunOutputFile[]" on
artifact?.runOutputFiles can hide runtime shape mismatches; replace it by
implementing a type guard (e.g., isRunOutputFile and isRunOutputFileArray) that
validates each object's required discriminators/fields and use that guard when
assigning outputFiles (instead of the cast) so you assign
artifact.runOutputFiles only when isRunOutputFileArray(artifact.runOutputFiles)
is true, otherwise default to an empty array; add the guard functions near the
component (or in a shared validators file) and reference them from
canvas-runnable-code-renderer.tsx when computing outputFiles.
- Around line 116-124: The component currently renders hardcoded user-facing
strings (e.g., the raw runStatus value and labels like "Run", "Files",
"stdout/stderr", "chars") inside CanvasRunnableCodeRenderer (see use of
runStatus and Badge) — replace those with i18n keys by importing and using the
project's translation hook (e.g., t or useTranslations) and wrap each
user-facing string (including the run status output) with t(...) calls using
appropriate keys; update all occurrences noted (around runStatus rendering and
the label locations at the other listed blocks) so the Badge children and all
labels call t('...') instead of literal text, and ensure keys are descriptive
and added to the locale files.
In `@services/platform/convex/agent_tools/artifacts/shared.ts`:
- Around line 24-33: The current isValidArtifactType function duplicates the
ArtifactType values with a manual OR-chain, risking drift; change it to derive
validity from the canonical artifactTypeEnum instead: import or reference
artifactTypeEnum and implement isValidArtifactType by checking whether value is
a member of artifactTypeEnum (e.g., using
Object.values(artifactTypeEnum).includes(value) or artifactTypeEnum.includes if
it’s an array), keeping the function signature isValidArtifactType(value:
string): value is ArtifactType and ensuring the check uses the enum’s runtime
values so future enum additions are automatically covered.
In `@services/platform/convex/node_only/sandbox/helpers/spawner_client.ts`:
- Around line 234-238: The catch block that wraps the fetch call to fetch(url, {
method: 'POST', headers, body }) currently swallows errors; update the catch to
log the error and context (e.g. URL and any identifying execution id) using
console.warn or console.error instead of leaving it empty, while keeping the
existing comment about best-effort cancellation; ensure the log includes the
error object so failures to POST cancellation requests are recorded for
debugging.
In `@services/platform/convex/sandbox/internal_mutations.test.ts`:
- Around line 57-85: The mock builder in makeBuilder only records status
predicates so tests won’t catch missing org/index predicates; update
builder.withIndex to also capture the index name argument and any org eq()
predicate (push { indexName: _name, field, value } into calls), and change
builder[Symbol.asyncIterator] to verify the captured indexName and org id (e.g.,
find a call with field === 'orgId' and expected value) before selecting
runningRows/queuedRows/completedRows so the fake query fails if
reserveSlotAndInsert drops or uses the wrong org/index filter.
- Around line 118-121: The test mock for internalMutation should be made generic
so the imported handlers keep their types and eliminate the repeated "as unknown
as" casts; update the mocked internalMutation signature to accept a generic type
parameter (e.g., internalMutation<TArgs, TReturn>(handler: MutHandler<TArgs,
TReturn>) or similar) and return a properly typed MutHandler<TArgs, TReturn>,
then adjust the mock implementation to forward types so usages like
reserveSlotAndInsert can be assigned directly (remove casts) — refer to the
mocked function name internalMutation and the test symbols reserveSlotAndInsert,
MutHandler, and baseArgs when making this change.
In `@services/sandbox-egress/Dockerfile`:
- Around line 27-28: Remove the inline HEALTHCHECK directive from the
Dockerfile: delete the HEALTHCHECK block ("HEALTHCHECK --interval=10s
--timeout=3s --retries=2 CMD nc -z 127.0.0.1 3128 || exit 1") so the image no
longer embeds orchestration-specific health policy; keep health checks
centralized in your Compose/service definitions instead (adjust
docker-compose.yml service for the sandbox-egress container to define equivalent
healthcheck there).
In `@services/sandbox-egress/tinyproxy.conf.template`:
- Around line 30-31: The configuration currently sets DisableViaHeader No which
contradicts the hardening intent; change the tinyproxy setting DisableViaHeader
from "No" to "Yes" in tinyproxy.conf.template so the Via header is suppressed on
outgoing requests (i.e., ensure the line reads DisableViaHeader Yes).
In `@services/sandbox-runtime/entrypoint.sh`:
- Around line 72-73: The uv command invocation "uv pip list --format=json
--python /workspace/.deps/python" uses an unsupported --python flag; change the
call to use a supported form such as "uv pip list --format=json" (or if you
intend to capture the current environment's installed packages, use "uv pip
freeze --format=json") so the command succeeds and produces
/workspace/install-report.json instead of failing silently.
In `@services/sandbox/Dockerfile`:
- Around line 32-33: The Dockerfile contains an image-level HEALTHCHECK (the
HEALTHCHECK instruction executing "curl -fsS http://127.0.0.1:8003/health ||
exit 1") which conflicts with the repo convention to centralize health checks in
compose; remove this HEALTHCHECK instruction from the Dockerfile and instead add
an equivalent healthcheck entry for this service in the compose/orchestration
configuration so health logic lives only at compose-level (keep the same
endpoint/interval/timeout/retries semantics when migrating).
- Line 12: Update the sandbox Dockerfile's FROM line to use the Bun 1.3 baseline
(e.g., change the FROM value to oven/bun:1.3-debian or newer) and update the
sandbox package.json `@types/bun` version to match the project's 1.3.11 (ensure
packageManager remains bun@1.3.10); locate the Dockerfile's FROM directive and
the services/sandbox/package.json dependency entry for "`@types/bun`" to make
these changes so the container OS and type definitions align with the project
baseline.
In `@services/sandbox/src/config.ts`:
- Around line 17-22: Remove the "as 'runc' | 'runsc'" cast on the runtime
variable and add a type‑guard to narrow the value via control flow: read const
raw = process.env.SANDBOX_RUNTIME ?? 'runc', implement a function like
isSandboxRuntime(value: string): value is 'runc' | 'runsc' and use it in the
existing check (throw if not valid); after the guard the runtime value will be
narrowed to 'runc' | 'runsc' for use elsewhere (refer to the SANDBOX_RUNTIME
env, the runtime/raw variable and add isSandboxRuntime as the narrowing helper).
In `@services/sandbox/src/server.ts`:
- Around line 80-89: The code adds parsed.executionId into inFlightSet before
validating parsed, which may insert undefined; update the flow around JSON.parse
and inFlightSet so you validate that parsed.executionId is present and a valid
identifier (e.g., non-empty string) before calling inFlightSet.add; if
validation fails, return the 400 Response (same structure used on JSON parse
failure) and do not add to inFlightSet. Target the JSON.parse block and the use
of parsed, ExecuteRequest, parsed.executionId and inFlightSet in server.ts to
implement this guard.
In `@services/sandbox/src/spawn_util.ts`:
- Around line 82-116: The timeout is armed after awaiting collectStdout()/stderr
which means a hung docker prevents the timeout from ever starting; move the
timeout race to start before/while collecting output by creating a single
Promise.race that includes the exited promise, the timeout promise (the existing
timer logic that sets timedOut, calls proc.kill('SIGKILL') and spawns the docker
killer using DOCKER_BIN when opts.killOnTimeoutContainer is set), and the
stdout/stderr collection (collectStdout() and new
Response(proc.stderr).arrayBuffer()); in practice modify the code around
collectStdout(), proc.exited, and opts.timeoutMs so you start the timeout
Promise concurrently (using Promise.race) with collectStdout()/stderr collection
instead of awaiting output first, ensuring timer is set/cleared appropriately
and reusing proc, timer, proc.kill, killOnTimeoutContainer, and Bun.spawn as in
the current logic.
In `@services/sandbox/src/spawn.ts`:
- Around line 250-252: The current timeout cleanup timers call dockerKill (and
similarly dockerRm / rm in the other block) and silently swallow any errors;
update the handlers around setTimeout and the cleanup callers (referencing the
killTimer callback and the other cleanup block that calls dockerRm/rm) to catch
rejections and surface them: log the full error via the service logger (include
containerName/sessionDir context), increment a cleanup-failure metric or
counter, and optionally perform a best-effort retry or escalate (e.g., schedule
a single retry after a short delay) rather than ignoring the failure. Ensure
dockerKill, dockerRm and rm callers use .catch(err => { logger.error(..., err);
metrics.increment('cleanup.failure'); /* optional retry */ }) so orphaned
containers/directories are visible in logs/metrics.
- Around line 220-226: Detect and reject duplicate executionId values before
staging by checking inFlight.has(req.executionId) immediately (before computing
containerName/workspaceHostDir and before creating the AbortController); if the
id already exists, return/throw a clear error to abort the request. Use the
existing inFlight Map (symbol: inFlight) as the guard, and only create and set
the entry ({ containerName, abort }) after the check so you don't overwrite an
existing run; ensure any cleanup paths rely on the presence of that exact
inFlight entry to avoid canceling or deleting another run.
- Around line 130-165: The walker currently explores every entry under outputDir
even after byte limits are reached; modify the async function walk(rel: string)
so it short-circuits traversal as soon as caps are exceeded: at the top of walk
check if totalAccepted >= caps.totalMax (or another combined limit you choose)
and return immediately; before recursing into subdirectories check the same and
skip walking that child; before calling stat/readFile also check per-file and
remaining total bytes to avoid unnecessary I/O; keep using truncatedCount to
record skipped files and ensure files, totalAccepted, truncatedCount updates
remain consistent.
In `@services/sandbox/src/volume.ts`:
- Around line 43-57: The current check-then-create flow using runDocker allows a
race where volume create can fail with "already exists" and is treated as fatal;
update the logic around runDocker calls in this module (the inspect and create
calls in volume.ts) to treat a create failure that indicates the volume already
exists as non-fatal: after running the create command (the variable named
create), if create.exitCode !== 0 inspect create.stderr/create.stdout for the
“already exists” message (or Docker’s equivalent) and, if present, treat the
operation as successful (return) instead of throwing; only throw for genuine
failures that are not the already-exists case.
In `@tests/container-smoke-test.sh`:
- Around line 87-92: The docker network create step can fail if the network
already exists or is in use; make the setup idempotent by checking for the
network before creating or by swallowing non-fatal create errors: use `docker
network inspect tale-sandbox-net >/dev/null 2>&1 || docker network create
--internal --ipv6=false --driver=bridge tale-sandbox-net >/dev/null` or append
`|| true` to the `docker network create ... tale-sandbox-net` call so the script
does not hard-fail when the network already exists or creation returns a
non-zero status.
In `@tools/cli/src/commands/doctor.ts`:
- Around line 119-127: The SANDBOX_TOKEN check in function checkSandboxToken
currently returns status: 'fail' when missing/short which conflicts with the PR
making SANDBOX_TOKEN optional; change the returned status to 'warn' (or, if you
want to keep fail behavior for certain modes, gate the failure by checking the
runtime mode/flag and only set status: 'fail' when that explicit mode requires
the token). Update both occurrences of this logic (the checkSandboxToken
function and the similar check at lines ~175-178) so they return a warning
message and appropriate fix text instead of failing the entire doctor run unless
explicitly required by mode.
In `@tools/cli/src/lib/compose/services/create-sandbox-service.ts`:
- Around line 29-33: The ports mapping currently uses ports: ['8003:8003'] which
binds the spawner to all host interfaces; change this to bind to loopback only
by replacing that entry with an explicit loopback host binding (e.g.,
'127.0.0.1:8003:8003') in the create-sandbox-service definition so the container
port 8003 is published only on localhost and reduces the exposed attack surface.
---
Outside diff comments:
In `@services/platform/app/features/chat/components/canvas/canvas-pane.tsx`:
- Around line 153-161: Replace the hardcoded English labels for runnable types
in TYPE_LABELS so they use the translation hook instead of literal strings:
update the entries for the keys "python_runnable" and "node_runnable" in the
TYPE_LABELS Record<CanvasContentType, string> to call the i18n translate
function (use the existing t hook or import useTranslation/t as used in this
component) with distinct keys like "canvas.type.python_runnable" and
"canvas.type.node_runnable"; also add those keys (with appropriate localized
values) to the locale resource files so the badge text is translated at runtime.
In `@tools/cli/src/lib/actions/deploy.ts`:
- Around line 47-48: Dry-run logging in deploy.ts’s ensureInfrastructure path
only reports the internal network; add a matching dry-run log for the sandbox
network so the operator sees both networks. Update the logger.info call(s) that
currently print `${prefix}Would ensure network: ${getProjectId()}_internal` to
also emit `${prefix}Would ensure network: ${getProjectId()}_sandbox` (do this in
the earlier dry-run return block and the later dry-run block around lines
59-63), referencing the same prefix, getProjectId() and logger.info used in
ensureInfrastructure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 563a2a93-503d-42b2-8301-27c5e0bc0db3
⛔ Files ignored due to path filters (2)
services/platform/convex/_generated/api.d.tsis excluded by!**/_generated/**services/sandbox/bun.lockis excluded by!**/*.lock
📒 Files selected for processing (68)
.commitlintrc.jsoncompose.ymlexamples/agents/chat-agent.jsonknip.config.tsservices/platform/app/features/chat/components/canvas/artifact-bar.tsxservices/platform/app/features/chat/components/canvas/canvas-context.tsxservices/platform/app/features/chat/components/canvas/canvas-pane.tsxservices/platform/app/features/chat/components/canvas/canvas-runnable-code-renderer.tsxservices/platform/app/features/chat/components/message-bubble.tsxservices/platform/convex/agent_tools/artifacts/artifact_create_tool.tsservices/platform/convex/agent_tools/artifacts/artifact_edit_tool.tsservices/platform/convex/agent_tools/artifacts/artifact_run_tool.tsservices/platform/convex/agent_tools/artifacts/shared.tsservices/platform/convex/agent_tools/tool_names.tsservices/platform/convex/agent_tools/tool_registry.tsservices/platform/convex/artifacts/internal_mutations.tsservices/platform/convex/artifacts/internal_queries.tsservices/platform/convex/artifacts/schema.tsservices/platform/convex/crons.tsservices/platform/convex/governance/soft_delete_helpers.tsservices/platform/convex/governance/soft_delete_validators.tsservices/platform/convex/lib/context_management/build_artifacts_context.tsservices/platform/convex/node_only/sandbox/helpers/spawner_client.tsservices/platform/convex/node_only/sandbox/internal_actions.tsservices/platform/convex/sandbox/internal_mutations.test.tsservices/platform/convex/sandbox/internal_mutations.tsservices/platform/convex/sandbox/internal_queries.tsservices/platform/convex/sandbox/output_mutations.tsservices/platform/convex/sandbox/schema.tsservices/platform/convex/schema.tsservices/platform/env.shservices/platform/messages/de.jsonservices/platform/messages/en.jsonservices/platform/messages/fr.jsonservices/sandbox-egress/Dockerfileservices/sandbox-egress/entrypoint.shservices/sandbox-egress/tinyproxy.conf.templateservices/sandbox-runtime/Dockerfileservices/sandbox-runtime/entrypoint.shservices/sandbox/Dockerfileservices/sandbox/Dockerfile.dockerignoreservices/sandbox/package.jsonservices/sandbox/seccomp.jsonservices/sandbox/src/auth.tsservices/sandbox/src/cleanup.tsservices/sandbox/src/config.tsservices/sandbox/src/docker_args.test.tsservices/sandbox/src/docker_args.tsservices/sandbox/src/server.tsservices/sandbox/src/spawn.tsservices/sandbox/src/spawn_util.tsservices/sandbox/src/types.tsservices/sandbox/src/volume.tsservices/sandbox/tsconfig.jsontests/container-image-test.shtests/container-smoke-test.shtools/cli/src/commands/doctor.tstools/cli/src/index.tstools/cli/src/lib/actions/deploy.tstools/cli/src/lib/actions/start.tstools/cli/src/lib/compose/generators/generate-dev-compose.tstools/cli/src/lib/compose/generators/generate-stateful-compose.tstools/cli/src/lib/compose/services/create-convex-service.tstools/cli/src/lib/compose/services/create-sandbox-egress-service.tstools/cli/src/lib/compose/services/create-sandbox-service.tstools/cli/src/lib/compose/types.tstools/cli/src/lib/config/ensure-env.tstools/cli/src/lib/docker/ensure-network.ts
| entry: ['src/**/*.test.ts'], | ||
| project: ['src/**/*.ts'], |
There was a problem hiding this comment.
Add the runtime entrypoint to knip entries.
Line 56 only includes tests as entrypoints; src/server.ts should be listed too, otherwise knip can misclassify live service code as unused.
Suggested fix
'services/sandbox': {
// Standalone Bun HTTP service. Not in root workspaces (own bun.lock);
// declare here so knip can resolve its entry points and ignore them
// from the "unused" sweep.
- entry: ['src/**/*.test.ts'],
+ entry: ['src/server.ts', 'src/**/*.test.ts'],
project: ['src/**/*.ts'],
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| entry: ['src/**/*.test.ts'], | |
| project: ['src/**/*.ts'], | |
| entry: ['src/server.ts', 'src/**/*.test.ts'], | |
| project: ['src/**/*.ts'], |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@knip.config.ts` around lines 56 - 57, The knip config's "entry" array
currently only lists test globs ('src/**/*.test.ts'), which can cause live
runtime files like src/server.ts to be treated as unused; update the
knip.config.ts "entry" to include the application's runtime entrypoint (e.g.,
add 'src/server.ts' to the entry array) so knip analyzes the live service code
when computing usage for the project.
| if (runStatus === 'failed' || runStatus === 'cancelled') { | ||
| return ( | ||
| <Badge | ||
| variant="outline" | ||
| icon={AlertTriangle} | ||
| className="text-destructive border-destructive/40" | ||
| > | ||
| {runStatus} | ||
| </Badge> |
There was a problem hiding this comment.
Externalize run-panel labels and status text through i18n.
These strings are user-facing but hardcoded (Run, Files, stdout/stderr, chars, and raw runStatus). They should go through t(...) keys to avoid locale regressions.
As per coding guidelines: “No hardcoded user-facing strings; always use the translation hook.”
Also applies to: 183-185, 200-202, 211-213, 222-224
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@services/platform/app/features/chat/components/canvas/canvas-runnable-code-renderer.tsx`
around lines 116 - 124, The component currently renders hardcoded user-facing
strings (e.g., the raw runStatus value and labels like "Run", "Files",
"stdout/stderr", "chars") inside CanvasRunnableCodeRenderer (see use of
runStatus and Badge) — replace those with i18n keys by importing and using the
project's translation hook (e.g., t or useTranslations) and wrap each
user-facing string (including the run status output) with t(...) calls using
appropriate keys; update all occurrences noted (around runStatus rendering and
the label locations at the other listed blocks) so the Badge children and all
labels call t('...') instead of literal text, and ensure keys are descriptive
and added to the locale files.
| const outputFiles: RunOutputFile[] = (artifact?.runOutputFiles ?? | ||
| []) as RunOutputFile[]; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Remove the as RunOutputFile[] cast and narrow with a typed validator/guard.
The cast can mask runtime shape mismatches from backend data; prefer a safe narrowing path before rendering.
As per coding guidelines: “Never use ‘as’, never ‘any’, never ‘unknown’; use type guards/generics/discriminated unions.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@services/platform/app/features/chat/components/canvas/canvas-runnable-code-renderer.tsx`
around lines 156 - 157, The current cast using "as RunOutputFile[]" on
artifact?.runOutputFiles can hide runtime shape mismatches; replace it by
implementing a type guard (e.g., isRunOutputFile and isRunOutputFileArray) that
validates each object's required discriminators/fields and use that guard when
assigning outputFiles (instead of the cast) so you assign
artifact.runOutputFiles only when isRunOutputFileArray(artifact.runOutputFiles)
is true, otherwise default to an empty array; add the guard functions near the
component (or in a shared validators file) and reference them from
canvas-runnable-code-renderer.tsx when computing outputFiles.
| export function isValidArtifactType(value: string): value is ArtifactType { | ||
| return ( | ||
| value === 'html' || | ||
| value === 'svg' || | ||
| value === 'markdown' || | ||
| value === 'mermaid' || | ||
| value === 'code' | ||
| value === 'code' || | ||
| value === 'python_runnable' || | ||
| value === 'node_runnable' | ||
| ); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Derive isValidArtifactType from artifactTypeEnum to avoid drift.
The hand-maintained OR-chain duplicates enum values and can silently fall out of sync on future type additions.
Proposed change
export function isValidArtifactType(value: string): value is ArtifactType {
- return (
- value === 'html' ||
- value === 'svg' ||
- value === 'markdown' ||
- value === 'mermaid' ||
- value === 'code' ||
- value === 'python_runnable' ||
- value === 'node_runnable'
- );
+ return artifactTypeEnum.safeParse(value).success;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/platform/convex/agent_tools/artifacts/shared.ts` around lines 24 -
33, The current isValidArtifactType function duplicates the ArtifactType values
with a manual OR-chain, risking drift; change it to derive validity from the
canonical artifactTypeEnum instead: import or reference artifactTypeEnum and
implement isValidArtifactType by checking whether value is a member of
artifactTypeEnum (e.g., using Object.values(artifactTypeEnum).includes(value) or
artifactTypeEnum.includes if it’s an array), keeping the function signature
isValidArtifactType(value: string): value is ArtifactType and ensuring the check
uses the enum’s runtime values so future enum additions are automatically
covered.
| try { | ||
| await fetch(url, { method: 'POST', headers, body }); | ||
| } catch { | ||
| // Cancellation is best-effort; the watchdog cron will reap stuck rows. | ||
| } |
There was a problem hiding this comment.
Add logging to the catch block per coding guidelines.
The empty catch block silently swallows fetch failures. While the comment explains cancellation is best-effort, logging the error would help debugging stuck executions.
As per coding guidelines: "No empty catch blocks; log with console.warn/console.error or re-throw."
Proposed fix
try {
await fetch(url, { method: 'POST', headers, body });
- } catch {
- // Cancellation is best-effort; the watchdog cron will reap stuck rows.
+ } catch (err) {
+ // Cancellation is best-effort; the watchdog cron will reap stuck rows.
+ console.warn(`[spawner] cancel ${executionId} failed:`, err);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try { | |
| await fetch(url, { method: 'POST', headers, body }); | |
| } catch { | |
| // Cancellation is best-effort; the watchdog cron will reap stuck rows. | |
| } | |
| try { | |
| await fetch(url, { method: 'POST', headers, body }); | |
| } catch (err) { | |
| // Cancellation is best-effort; the watchdog cron will reap stuck rows. | |
| console.warn(`[spawner] cancel ${executionId} failed:`, err); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/platform/convex/node_only/sandbox/helpers/spawner_client.ts` around
lines 234 - 238, The catch block that wraps the fetch call to fetch(url, {
method: 'POST', headers, body }) currently swallows errors; update the catch to
log the error and context (e.g. URL and any identifying execution id) using
console.warn or console.error instead of leaving it empty, while keeping the
existing comment about best-effort cancellation; ensure the log includes the
error object so failures to POST cancellation requests are recorded for
debugging.
| const killTimer = setTimeout(() => { | ||
| void dockerKill(containerName).catch(() => {}); | ||
| }, timeoutMs); |
There was a problem hiding this comment.
Don’t silently swallow kill/cleanup failures.
If dockerKill, dockerRm, or rm fail, this leaves orphaned containers or session directories with no signal in logs or metrics. In a long-running spawner, that turns a recoverable cleanup problem into an invisible disk/slot leak.
Also applies to: 361-366
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/sandbox/src/spawn.ts` around lines 250 - 252, The current timeout
cleanup timers call dockerKill (and similarly dockerRm / rm in the other block)
and silently swallow any errors; update the handlers around setTimeout and the
cleanup callers (referencing the killTimer callback and the other cleanup block
that calls dockerRm/rm) to catch rejections and surface them: log the full error
via the service logger (include containerName/sessionDir context), increment a
cleanup-failure metric or counter, and optionally perform a best-effort retry or
escalate (e.g., schedule a single retry after a short delay) rather than
ignoring the failure. Ensure dockerKill, dockerRm and rm callers use .catch(err
=> { logger.error(..., err); metrics.increment('cleanup.failure'); /* optional
retry */ }) so orphaned containers/directories are visible in logs/metrics.
| const inspect = await runDocker(['volume', 'inspect', name]); | ||
| if (inspect.exitCode === 0) return; // already exists, already chowned | ||
|
|
||
| const create = await runDocker([ | ||
| 'volume', | ||
| 'create', | ||
| '--label', | ||
| 'tale.sandbox-cache=1', | ||
| name, | ||
| ]); | ||
| if (create.exitCode !== 0) { | ||
| throw new Error( | ||
| `volume: failed to create cache volume ${name}: ${create.stderr.trim() || create.stdout.trim()}`, | ||
| ); | ||
| } |
There was a problem hiding this comment.
Handle concurrent first-create races for Docker volumes.
Line 43–57 has a check-then-create race. Two callers can both see “not exists”; one volume create will fail with “already exists”, which is currently treated as fatal.
Suggested fix
const create = await runDocker([
'volume',
'create',
'--label',
'tale.sandbox-cache=1',
name,
]);
if (create.exitCode !== 0) {
+ const existsNow = await runDocker(['volume', 'inspect', name]);
+ if (existsNow.exitCode === 0) return;
throw new Error(
`volume: failed to create cache volume ${name}: ${create.stderr.trim() || create.stdout.trim()}`,
);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/sandbox/src/volume.ts` around lines 43 - 57, The current
check-then-create flow using runDocker allows a race where volume create can
fail with "already exists" and is treated as fatal; update the logic around
runDocker calls in this module (the inspect and create calls in volume.ts) to
treat a create failure that indicates the volume already exists as non-fatal:
after running the create command (the variable named create), if create.exitCode
!== 0 inspect create.stderr/create.stdout for the “already exists” message (or
Docker’s equivalent) and, if present, treat the operation as successful (return)
instead of throwing; only throw for genuine failures that are not the
already-exists case.
| docker network rm tale-sandbox-net >/dev/null 2>&1 || true | ||
| docker network create \ | ||
| --internal \ | ||
| --ipv6=false \ | ||
| --driver=bridge \ | ||
| tale-sandbox-net >/dev/null |
There was a problem hiding this comment.
Avoid hard-failing when the sandbox network already exists.
Line 87–92 can fail under concurrent/local runs: if docker network rm is ignored (network still in use), docker network create ... tale-sandbox-net exits non-zero and the script aborts.
Suggested fix
-docker network rm tale-sandbox-net >/dev/null 2>&1 || true
-docker network create \
- --internal \
- --ipv6=false \
- --driver=bridge \
- tale-sandbox-net >/dev/null
+if ! docker network inspect tale-sandbox-net >/dev/null 2>&1; then
+ docker network create \
+ --internal \
+ --ipv6=false \
+ --driver=bridge \
+ tale-sandbox-net >/dev/null
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| docker network rm tale-sandbox-net >/dev/null 2>&1 || true | |
| docker network create \ | |
| --internal \ | |
| --ipv6=false \ | |
| --driver=bridge \ | |
| tale-sandbox-net >/dev/null | |
| if ! docker network inspect tale-sandbox-net >/dev/null 2>&1; then | |
| docker network create \ | |
| --internal \ | |
| --ipv6=false \ | |
| --driver=bridge \ | |
| tale-sandbox-net >/dev/null | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/container-smoke-test.sh` around lines 87 - 92, The docker network
create step can fail if the network already exists or is in use; make the setup
idempotent by checking for the network before creating or by swallowing
non-fatal create errors: use `docker network inspect tale-sandbox-net >/dev/null
2>&1 || docker network create --internal --ipv6=false --driver=bridge
tale-sandbox-net >/dev/null` or append `|| true` to the `docker network create
... tale-sandbox-net` call so the script does not hard-fail when the network
already exists or creation returns a non-zero status.
| function checkSandboxToken(env: NodeJS.ProcessEnv): Check { | ||
| if (!env.SANDBOX_TOKEN || env.SANDBOX_TOKEN.length < 32) { | ||
| return { | ||
| name: 'SANDBOX_TOKEN', | ||
| status: 'fail', | ||
| detail: | ||
| 'missing or too short — required for HMAC auth between Convex and the sandbox spawner', | ||
| fix: 'Re-run `tale init` (or set a 64-char hex value manually)', | ||
| }; |
There was a problem hiding this comment.
SANDBOX_TOKEN check severity conflicts with optional-token behavior.
This currently fails the whole doctor run when unset/short, but this PR explicitly makes SANDBOX_TOKEN optional. Downgrade to warn (or gate fail by mode) to avoid false hard failures.
Also applies to: 175-178
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/cli/src/commands/doctor.ts` around lines 119 - 127, The SANDBOX_TOKEN
check in function checkSandboxToken currently returns status: 'fail' when
missing/short which conflicts with the PR making SANDBOX_TOKEN optional; change
the returned status to 'warn' (or, if you want to keep fail behavior for certain
modes, gate the failure by checking the runtime mode/flag and only set status:
'fail' when that explicit mode requires the token). Update both occurrences of
this logic (the checkSandboxToken function and the similar check at lines
~175-178) so they return a warning message and appropriate fix text instead of
failing the entire doctor run unless explicitly required by mode.
| // Dev convention: publish 8003 to host loopback so `bun dev`'s local | ||
| // convex-local-backend (running on the host) can reach the spawner. | ||
| // Matches rag (8001) and crawler (8002). The `tale deploy` generator | ||
| // can omit this for hardened prod deployments — same as those services. | ||
| ports: ['8003:8003'], |
There was a problem hiding this comment.
Bind sandbox port to loopback explicitly.
ports: ['8003:8003'] exposes the spawner on all host interfaces, but the comment says loopback-only. With docker.sock mounted, this widens attack surface unnecessarily.
Suggested fix
- ports: ['8003:8003'],
+ ports: ['127.0.0.1:8003:8003'],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Dev convention: publish 8003 to host loopback so `bun dev`'s local | |
| // convex-local-backend (running on the host) can reach the spawner. | |
| // Matches rag (8001) and crawler (8002). The `tale deploy` generator | |
| // can omit this for hardened prod deployments — same as those services. | |
| ports: ['8003:8003'], | |
| // Dev convention: publish 8003 to host loopback so `bun dev`'s local | |
| // convex-local-backend (running on the host) can reach the spawner. | |
| // Matches rag (8001) and crawler (8002). The `tale deploy` generator | |
| // can omit this for hardened prod deployments — same as those services. | |
| ports: ['127.0.0.1:8003:8003'], |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/cli/src/lib/compose/services/create-sandbox-service.ts` around lines 29
- 33, The ports mapping currently uses ports: ['8003:8003'] which binds the
spawner to all host interfaces; change this to bind to loopback only by
replacing that entry with an explicit loopback host binding (e.g.,
'127.0.0.1:8003:8003') in the create-sandbox-service definition so the container
port 8003 is published only on localhost and reduces the exposed attack surface.
… proxy, spawner)
Container-side foundation for the `code_run` agent tool: an ephemeral
Python/Node sandbox the LLM can invoke to run code with arbitrary packages
and surface generated files (e.g. .pptx via python-pptx) as chat attachments.
Components:
- services/sandbox-runtime: lean Python 3.12 + Node 24 + uv image. Entrypoint
installs requested packages on demand (`--only-binary=:all:` for pip and
`--ignore-scripts` for npm by default — closes setup.py / postinstall ACE
vectors per R2.7), emits PHASE markers for the chat UI, then execs user
code at /workspace/code/main.{py,js}.
- services/sandbox-egress: tinyproxy sidecar on tale-sandbox-net (an
internal-only Docker bridge). Filters CONNECT host requests against an
allow-list (pypi.org, files.pythonhosted.org, registry.npmjs.org, github
package endpoints). Replaces the originally-planned iptables IP allow-list
which R1.3/R2.1 showed was unsafe due to shared Fastly/Cloudflare CDN IPs.
- services/sandbox: ~250 LOC Bun HTTP service. POST /v1/execute with
HMAC-signed body spawns one ephemeral container; POST /v1/cancel/:id
propagates AbortSignal as docker kill. Workspace is a per-call tmpfs
Docker volume (size=256m, hard ENOSPC cap per R2.2); pip/npm caches are
per-org named volumes (closes the R2.3 cross-tenant wheel-cache poison
vector). docker_args.ts is a pure builder with strict regex validation;
the #1 regression gate per R1.22 has 9 passing unit tests asserting the
argv shape and that user code never reaches argv.
- compose.yml: registers both services and the internal `sandbox` network
pinned to `tale-sandbox-net`. IPv6 disabled on the bridge to prevent
v4-allowlist bypass via v6 routes (R1.3).
- .commitlintrc.json: add `sandbox` scope.
Convex schema, executeCode action, code_run tool, CLI compose generator
work, and tests follow in M2 and M3.
Plan: /home/larry/.claude/plans/presentation-generation-from-prompts-delightful-aho.md
…ine (M2)
Convex-side of the sandbox feature: a new code_run agent tool, the
sandboxExecutions audit table, and the executeCode internal action that
owns the spawner HTTP round-trip + transactional storage uploads.
Schema (services/platform/convex/sandbox/schema.ts):
- sandboxExecutions table — uploadedBy, agentSlug, lifecycleStatus,
statusChangedAt, heartbeatAt, estimatedSeconds/actualSeconds, full
output-files validator, structured errorCode taxonomy (R1.12 + R2.8).
- Indexes: by_organizationId_and_status (quota counting), by_org_user
(GDPR cascade), by_status (watchdog), by_threadId, by_organizationId.
Mutations (services/platform/convex/sandbox/internal_mutations.ts):
- reserveSlotAndInsert — atomic concurrency-cap + daily-CPU-budget +
audit-row insert in one mutation, closes the TOCTOU race R1.8/R1.10
flagged. Uses the same withIndex pattern as video_links/mutations.ts.
- setRunning / heartbeat / finalize.
- recoverStuckSandboxes — watchdog cron flips rows older than
2×max-timeout to failed/SPAWNER_UNAVAILABLE (Convex 30-min hard-kill
skips action try/finally; mirrors recoverStuckTranscriptions pattern).
Node action (services/platform/convex/node_only/sandbox/internal_actions.ts):
- executeCode — reserves slot → resolves+validates inputFiles via internal
query (IDOR check) → setRunning + 60s heartbeat loop → POSTs to spawner
with HMAC-signed body → all-or-nothing storage upload (rolls back blobs
on partial failure) → batched fileMetadata insert → finalize. Per
feedback_no_empty_catch: infra failures throw, user-code failures return
a structured {success:false, errorCode, ...}.
Tool (services/platform/convex/agent_tools/code/code_run_tool.ts):
- code_run — Python 3.12 + Node 24, packages on demand, inputFiles
ref-by-fileMetadataId, allowSdist/allowInstallScripts opt-in overrides,
full errorCode-recovery table in description, tool-selection precedence
vs excel/pdf/docx/image (per R1.15).
Wiring:
- schema.ts registers sandboxExecutions.
- crons.ts adds 'recover stuck sandbox executions (every 5 min)'.
- soft_delete_validators registers 'sandboxExecution'; soft_delete_helpers
maps it to the table + uploadedBy author field.
- tool_registry / tool_names register codeRunTool.
Tests + CLI integration (M3) follow.
NOTE: services/platform/convex/_generated/api.d.ts will regenerate on the
next `bunx convex deploy` / `tale start`; the typescript errors against
internal.sandbox.* in this commit are the documented stale-codegen state
from feedback_api_dts_autogenerated, not bugs.
…(M3) CLI work to deploy the sandbox stack via `tale start` / `tale deploy`: - ComposeService type extended with cap_add, mem_limit, pids_limit, ulimits, security_opt, runtime — previously absent from the generator which silently dropped these on the convex service. - BONUS FIX surfaced by sandbox review (R1.17): create-convex-service.ts was shipping the production convex container WITHOUT NET_ADMIN, so services/convex/docker-entrypoint.sh:79 was silently logging "iptables present but no NET_ADMIN capability — SSRF firewall NOT installed" on every deploy. Apply the missing cap_add + mem_limit + pids_limit + ulimits flags so production deployments finally get the SSRF egress firewall the entrypoint was always trying to install. - New service factories: create-sandbox-service (HTTP spawner, mounts docker.sock, two-network membership) and create-sandbox-egress-service (tinyproxy sidecar on the internal sandbox bridge). - STATEFUL_SERVICES includes 'sandbox' + 'sandbox-egress' so the deploy.ts auto-include-missing-stateful logic picks them up on the next `tale deploy` after upgrade — no migration registry entry needed. - ensureSandboxNetwork() creates `tale-sandbox-net` (fixed Docker name, --internal, --ipv6=false). Called from both start.ts (dev) and deploy.ts (prod) infrastructure setup. - ensure-env: SANDBOX_TOKEN added to requiredVars + secretDefaults (auto-generated 32-byte hex). generateEnvContent emits SANDBOX_TOKEN + SANDBOX_RUNTIME / SANDBOX_EGRESS_ALLOWLIST comment block for operators to override. - New command: `tale doctor` — preflight checks for sandbox host requirements (docker, /var/run/docker.sock, runsc registration with dockerd, userns-remap, AppArmor docker-default, SANDBOX_TOKEN presence). R1.17 surfaced that there was no `doctor` command at all; scope here is intentionally narrow (sandbox-relevant only) to avoid scope creep — future Docker version / disk headroom checks belong here too but separately. CLI typecheck passes (`bunx tsc --noEmit`). Next: tests (M3 test pass) + bun run check until green.
…ient
oxlint flagged 4 issues in M1/M2 code:
- spawner_client.ts:97 — `throw new Error(...)` inside `catch (err)` without
forwarding the original. Added `{ cause: err }` so debugging keeps the
network-error chain.
- internal_actions.ts:445 — same pattern; same fix.
- spawner_client.ts:113 — `await res.json() as SpawnerExecuteResponse`. Annotated
with `oxlint-disable typescript/no-unsafe-type-assertion` because the wire
contract is validated on the spawner side; trusting it here is by design.
- internal_actions.ts:177 — `err.data as { message?: string }`. Same disable,
scoped to the line that runs only after a `'message' in err.data` narrowing.
The remaining lint error (lib/seo/integration.test.ts) predates this branch
and is unrelated to the sandbox work.
) Mocks _generated/server.internalMutation so the real handler is callable with a fabricated ctx (matches the file_metadata/internal_mutations.test.ts pattern). Covers: - Empty in-flight → row inserted with status='queued', lifecycleStatus='active'. - Cap reached (4 running) → throws ConvexError (atomic concurrency cap, closes the TOCTOU race R1.8/R1.10 flagged). - Daily CPU budget pre-debit overflow (4 × 500s prior + 30s requested > 1800s cap) → throws — pre-debit semantics verified, closes R1.10's post-debit overshoot. - recoverStuckSandboxes — only the row whose heartbeatAt is older than 2×max-timeout gets flipped to failed/SPAWNER_UNAVAILABLE. All 4 tests pass via vitest. Combined with the 9-test argv builder gate shipped in M1, that's two of R1.22's five critical regression gates. The remaining three (in-container privilege assertion, fileMetadata IDOR via inputFiles, cancellation propagation) require either a running docker daemon (privilege) or a Convex test harness (IDOR / cancellation); both are integration-test scope and best added when wiring up CI for the sandbox stack.
Five real bugs surfaced by trying to actually hit /v1/execute end-to-end: 1. **Docker CLI API skew (L3 /health 503)** — Debian's `docker.io` 20.10.5 speaks API 1.41; modern daemons require ≥1.44. Switched to the official static CLI via `COPY --from=docker:27-cli` and changed /health from `docker info --format` (which panics in old CLI) to `docker version --format`. 2. **tmpfs Docker volumes aren't shared between docker run calls** — my original multi-helper staging (busybox containers writing to a shared workspace volume) silently failed because each `docker run` of a tmpfs-driver-local volume creates a fresh mount. Tried piping tar to stdin into the runtime container, but then… 3. **`docker cp` can't read from --tmpfs mounts** — verified directly: file present inside container (`docker exec ls`) but `docker cp` returns "Could not find the file ... in container". Switched workspace to a 1:1 host bind mount under /var/lib/tale-sandbox/sessions/<uuid>/. Spawner now stages files via Bun fs (no busybox/tar dance) and harvests outputs the same way. The compose + CLI factory mount /var/lib/tale-sandbox into the spawner container at the same path so the docker daemon (resolving against host fs) and the spawner agree on it. Trade-off: lose the perfect tmpfs ENOSPC cap; keep --ulimit fsize=100m per file + post-run rm -rf. 4. **uv/pip fail under --read-only** — uv writes to $HOME/.cache/uv, nobody's $HOME is /nonexistent which is RO. Set HOME=/tmp (we have --tmpfs /tmp) + UV_CACHE_DIR=/cache/pip. 5. **Cache volume permission denied** — new Docker volumes are root-owned by default; the runtime user is 65534. ensureCacheVolume now detects first-creation and runs a transient busybox to chown the mount point 65534:65534. 6. **Internal-only sandbox network blocks tinyproxy's own outbound** — sandbox-egress couldn't resolve pypi.org because `internal: true` cut all DNS too. Put sandbox-egress on BOTH `sandbox` (where runtime containers reach it) and `internal` (where it has internet for the upstream tunnel). Runtime containers stay solely on sandbox. 7. **Timeout didn't kill the container** — killing the docker CLI process doesn't stop the sibling container; it just disconnects the wrapper. Two-tier timeout now: inner timer issues `docker kill <name>` at timeoutMs; outer (CLI process kill) at timeoutMs+30s as belt-and- suspenders. 8. **Error classifier patterns stale** — uv's "no matching distribution" has become "unsatisfiable"; runtime-time egress denial exits with 1 not 64. Broadened PACKAGE_NOT_FOUND regex; classify EGRESS_DENIED on any exit when stderr matches. 9. **tinyproxy log file root-owned** — entrypoint chowns the log to nobody so tinyproxy (which drops privs) can write to it. L4 verified end-to-end: - python hello world: 620ms - python-pptx (warm cache): 1.18s, real .pptx file with 3 slides - TIMEOUT: exit 137 at 3.27s for a sleep(30) with timeoutMs=3000 - EGRESS_DENIED, PACKAGE_NOT_FOUND, RUNTIME_ERROR: all classified - HMAC mismatch: 401 api.d.ts regen picked up sandbox/* + agent_tools/code/* on the platform restart that happened during testing — included.
The L4 spawner regex was UUID-only (hex + hyphens), but executeCode passes
the Convex audit row's _id (lowercase alphanumeric, base36-ish) as the
spawner executionId. Broaden both regexes to [a-zA-Z0-9_-]{1,64} — still
safe for Docker container names and argv positions, now accepts both
UUIDs and Convex ids.
L5 verified end-to-end: executeCode action via convex run produced a
real .pptx in 2.6s with audit row + fileMetadata row both populated.
…nt demo
- compose.yml + create-sandbox-service: publish 8003:8003 (same shape as
rag/crawler dev convention). `bun dev` runs convex-local-backend on
the host; without this the executeCode action can't resolve `sandbox`
(Docker DNS). The tale-deploy generator can omit the port for hardened
prod deployments — same option as rag/crawler.
- examples/agents/chat-agent.json: add `code_run` to toolNames; flip
Rule 7's "do NOT produce .pptx — there is no PPTX export" line in
EN / DE / FR system prompts to "for downloadable .pptx call code_run
with python-pptx==1.0.2". HTML in-chat slide guidance via
artifact_create is preserved as the in-chat default.
Verified end-to-end via host loopback: HMAC-signed POST to
127.0.0.1:8003/v1/execute produces a real .pptx in 2.3s.
Recipe for bun dev:
1. .env has SANDBOX_TOKEN=<hex> and SANDBOX_URL=http://sandbox:8003
(for the dockerized convex container).
2. services/platform/.env.local has the same SANDBOX_TOKEN and
SANDBOX_URL=http://127.0.0.1:8003 (sync-convex-env-from-dotenv
applies higher priority for .env.local; bun dev's local convex
backend picks up the loopback URL).
3. `bun dev` from services/platform/.
When code_run finished it left the file as a fileMetadata row but never
attached it to the assistant message — so the LLM said "your pptx is
ready" and nothing appeared in the chat bubble. Mirror what excel/pdf/
docx tools do:
- insertOutputFiles + executeCode now thread storageId through to the
tool layer (was only returning fileMetadataId).
- code_run tool calls appendFilePart(ctx, {fileName, mimeType,
downloadUrl}) per output file, with downloadUrl built via
buildDownloadUrl(storageId, name) — same helper excel_tool uses.
Verified: re-ran executeCode directly via convex run; the response now
carries storageId for each file. The chat-bubble attachment fires when
the tool is invoked through the agent loop (next chat send should show
the .pptx card alongside the LLM's text response).
bun dev surfaced this on first invocation:
Uncaught Error: SANDBOX_TOKEN env var is required for sandbox/code_run;
set it in .env
at handler (.../convex/node_only/sandbox/internal_actions.ts:438:8)
bun dev's convex-local-backend runs on the host with whatever env it
gets from .env / .env.local. The hard throw in getSpawnerToken() turned
"forgot to set the secret" into "tool is dead." rag (8001) and crawler
(8002) both sit on the same internal Docker network with no auth and
just work; sandbox should match.
Auth is now opt-in. Both sides agree:
unset on both -> unsigned requests accepted; one-time boot
warning on the spawner
set on spawner only -> 401 (catches client/server config drift)
set on client only -> harmless (client signs, server ignores)
set on both -> HMAC required, mismatch = 401
tale init still auto-generates SANDBOX_TOKEN so prod stays HMAC-on by
default; this only removes the hard-error path when the secret happens
to be missing at runtime.
Files (4):
- services/sandbox/src/types.ts sandboxToken: string | null
- services/sandbox/src/config.ts drop requireEnv; treat "" as unset
- services/sandbox/src/server.ts gate verify() on token !== null;
warn once on boot in unauth mode
- spawner_client.ts drop throw; omit signature header
when token is null
Verified:
- spawner with SANDBOX_TOKEN unset boots, logs the warning, accepts an
unsigned POST and runs python in 482ms.
- spawner with SANDBOX_TOKEN set still returns 401 on bad/missing sig.
- 9 argv unit tests still pass.
…r bun dev)
User's `bun dev` action threw `fetch failed: sandbox:8003` because the
client defaulted to the Docker DNS name. From the host, that doesn't
resolve.
The convention rag and crawler follow is the inverse:
- Code default = http://localhost:<port> (works for bun dev with
published ports)
- services/platform/env.sh sets the docker DNS as the default for
in-container processes, e.g. RAG_URL="${RAG_URL:-http://rag:8001}".
The platform docker-entrypoint sources env.sh and then convex env
sets the value into the convex backend, so dockerized actions see
the docker name.
Two changes:
- spawner_client.getSpawnerUrl() defaults to http://localhost:8003.
- services/platform/env.sh adds SANDBOX_URL="${SANDBOX_URL:-http://sandbox:8003}"
next to RAG_URL / CRAWLER_URL / SEARCH_SERVICE_URL.
Net effect:
bun dev (host node) → code default localhost:8003 → works
zero-config via the published port
dockerized convex (in compose) → env.sh default sandbox:8003 → works
operator override (either) → SANDBOX_URL in .env / .env.local /
docker compose environment block
takes precedence
Sandbox argv unit tests (9) still pass.
Foundation for folding code_run into the artifact system (plan
Refinement 2). The sandbox execution layer is unchanged behaviorally;
the wire protocol grows phase events so the canvas can show live
progress, and the artifacts schema grows run-state fields so a runnable
artifact's row holds its current execution state.
Schema (artifacts/schema.ts):
- type union gains `python_runnable` + `node_runnable`
- run-state fields (all optional — non-breaking per
feedback_deprecate_dont_delete_schema_fields):
runPackages, runOptions, runStatus (queued|installing|running|
completed|failed|cancelled), runProgress, runStartedAt, runCompletedAt,
runExitCode, runErrorCode, runErrorMessage, runStdoutPreview/StorageId,
runStderrPreview/StorageId, runOutputFiles, runExecutionId
Mutations (artifacts/internal_mutations.ts):
- initArtifactRun: set runStatus='queued' + clear prior-run remnants
- patchArtifactRunProgress: mid-flight status/progress updates fired
on PHASE events from the spawner
- finalizeArtifactRun: write final exit code + output files + clear
the live progress string
Spawner (services/sandbox/src/):
- spawn_util.runDocker now accepts onStdoutChunk callback that fires
per chunk during the docker run (vs after exit).
- spawn.executeRequest accepts ExecuteRequestOptions.onPhase and runs
a line-buffered parser over the stdout chunks; PHASE: installing /
PHASE: running lines fire the callback with {phase} events.
- server /v1/execute switches from buffered JSON response to SSE:
emits 'event: phase data: {...}' per phase + final
'event: result data: {...full response...}'.
Verification:
- 9 argv unit tests still pass.
- bun run typecheck clean.
- curl smoke against rebuilt container: SSE stream emits two phase
events ('installing', 'running') then the final result event with
a real .pptx in outputFiles.
M5b will switch the convex spawner_client to a streaming consumer and
wire artifact_create / artifact_edit to call executeCode for runnable
types.
…r runnable types (M5b)
Builds on M5a's schema + SSE foundation. The agent-facing tool surface
is now the artifact pair (create/edit); the sandbox spawner stays
unchanged and is invoked transparently when an artifact's type is
runnable.
- spawner_client.spawnerExecute now consumes the SSE stream from the
spawner. Phase events fire an optional onPhase callback; the final
`event: result` payload is returned as the same SpawnerExecuteResponse
shape callers had before (drop-in replacement; signature gained an
optional `callbacks` arg).
- spawner_client also exposes an SSE event parser that tolerates partial
reads and chunk boundaries.
- executeCode internal_action gains optional `artifactId`. When set:
- onPhase fires patchArtifactRunProgress with a human-readable
progress string ("Installing python-pptx" / "Running code") so the
canvas runnable-code-renderer can subscribe and show live state.
- On success, finalizeArtifactRun writes runStatus=completed plus
exit code, stdout/stderr previews, and runOutputFiles. The audit
sandboxExecutions row still gets its own forensics; the artifact
row holds the latest result for fast canvas reads.
- artifact_create_tool:
- shared.ts adds runnable types to the enum + isRunnableArtifactType /
runnableLanguage helpers.
- input schema gains optional `packages`, `allowSdist`,
`allowInstallScripts`, `timeoutMs` (gated semantically on runnable
types).
- execute(): after the canonical content settle, runnable types
call initArtifactRun then executeCode with the new artifactId so
the spawner streams progress straight to the artifact row.
- artifact_edit_tool:
- both patch and rewrite success branches call a new local
maybeRerun() helper. For runnable types this reloads the row to
pick up runPackages/runOptions captured at create time, fires
initArtifactRun to clear prior-run remnants, then re-invokes
executeCode with the new content.
- LLM can iterate via small patches without re-emitting the full
script; canvas subscribes to the same artifact row and updates.
The `code_run` standalone tool still exists; M5c will remove it and
update the demo agent / system prompt to point at the unified path.
…/ node_runnable (M5c) remove the standalone code_run tool in favor of two new artifact types (python_runnable, node_runnable) so sandbox code execution shares the artifact tool surface: live source streaming in the canvas pane, patch / rewrite edit semantics for multi-turn refinement, and LLM context-aware iteration via build_artifacts_context. - canvas: new canvas-runnable-code-renderer with source on the left and status chip + file chips + collapsible stdout/stderr on the right; canvas-pane routes python_runnable / node_runnable to it - context: build_artifacts_context surfaces runStatus / runErrorCode / runOutputFiles attributes for runnable types so follow-up turns pick the right next action (patch on failure, leave alone on completion) - agent: chat-agent.json EN/DE/FR rule 7 points at artifact_create with type "python_runnable" instead of code_run - registry: drop code_run from tool_names / tool_registry and remove the tool source file
artifact_create for python_runnable / node_runnable types was crashing the chat with "Element type is invalid: ... got: undefined" because ArtifactBar's TYPE_ICONS and message-bubble's ARTIFACT_PILL_ICONS only had entries for the original five canvas types. tsc didn't catch this because the artifact row reaches them as `any` through useQuery. also store storageId alongside fileMetadataId on artifact runOutputFiles so the canvas right-pane file chip can build a download url without a second roundtrip (api.file_metadata.queries.getById doesn't exist).
…tool results
before this fix, artifact_create and artifact_edit awaited executeCode for
runnable types but discarded the entire run outcome — they only returned
{success: true, artifactId, revision, message} to the llm. so a python
script that exited with RUNTIME_ERROR still showed up as a successful
tool call and the llm happily told the user "文件已生成" while no file
existed.
now both tools forward the run outcome to the llm for runnable types:
runStatus, runExitCode, runErrorCode, runErrorMessage, runStdoutPreview,
runStderrPreview, durationMs, files[], executionId. tool-level success
is redefined for runnable types as (runStatus === 'completed' && files
.length > 0) so the llm can branch correctly. the tool description gains
an errorCode recovery table and an explicit "never say file generated
unless success === true && files.length > 0" guardrail. chat-agent rule
7 EN/DE/FR gain the same checkpoint.
canvas right pane and build_artifacts_context were already showing the
failure state correctly — this aligns the llm's reply with what the user
sees in canvas.
…nderer users reported that after a python_runnable artifact finishes, the download chip and run status were nowhere to be seen on the canvas. root cause: the canvas-pane keeps `showStreamingSource=true` for 10s after stream start (MIN_SOURCE_VIEW_MS dwell window), and during that window the source-only branch runs first — it knew nothing about runnable types and just rendered raw source via CanvasCodeRenderer, bypassing the runnable renderer with its file chips / status panel. route runnable artifacts to CanvasRunnableCodeRenderer regardless of the streaming-source dwell. the renderer now also handles its own `isStreaming` so the source view still renders with the streaming caret during the source phase, while the run panel updates reactively from the artifact row. additionally, the runnable renderer stacks at narrow canvas widths. moved the execution panel ON TOP of the source for narrow canvases (default 480px, below md breakpoint) so the file chip is visible immediately. on wide canvases (≥768px) the side-by-side layout is kept.
…p md: layout switch the responsive md:flex-row split was based on viewport width, not canvas pane width, so a typical 1536px viewport with a 480px canvas triggered md: mode and tried to side-by-side the panel — but the panel's md:w-80 (320px) plus a flex-1 source ate the whole canvas and squeezed the run panel out of layout (height=0, width=0 confirmed via getBoundingClientRect). simpler: always stack execution panel on top, source below. canvas pane max width is 900px anyway, so side-by-side was cramped even when it did fit. tailwind container queries would be the responsive answer but a simple stack is fine here. also add the canvas.runDone i18n key in en/de/fr that the renderer was referencing (was leaking the raw key into the ui).
… the race to execute artifact_create's onInputDelta inserts a placeholder row (toolCallId recorded, liveStreamMode='create', empty content) and only sets state.artifactId AFTER the mutation roundtrip returns. when the model finishes streaming the input fast enough, the AI SDK fires `execute` before that mutation roundtrip lands — state.artifactId is still undefined, so execute falls through to the createArtifact else branch and inserts a SECOND row with full content. result: two artifacts for one tool call, one empty + one settled, showing up in the artifact bar as two duplicate-titled v1 tabs. defensive fallback: in execute's else branch, before inserting a new row, look up any in-flight create-placeholder for this toolCallId scoped to org+thread. if found, finalize THAT row instead of inserting fresh. the new internal query findStreamingPlaceholderByToolCallId is scoped tight enough that it can never claim a settled or cross-conversation artifact. no schema change. lookup runs only on the rare fallback path so the extra (indexless) toolCallId filter inside the org-and-thread index walk is fine — the thread's recent artifact set is small.
…n tool artifact_create and artifact_edit for python_runnable / node_runnable now behave like static types — they only write / patch the source row. they no longer await executeCode, no longer surface a run outcome, no longer auto-run. the new artifact_run tool is the explicit, llm-driven trigger to execute the script: it reads the row's persisted runPackages / runOptions, calls executeCode, and returns the full run outcome (runStatus, runErrorCode, runStderrPreview, files[], executionId, ...). why split it out: real chat smoke showed the llm keeping calling artifact_create on a fresh thread when a run failed — three duplicate v1 tabs of "大熊猫介绍 PPTX" accumulating, never invoking artifact_edit. the unified semantics conflated "author source" with "execute it" and the natural llm recovery from a failed run was to create a new artifact. making execution its own tool decouples the two decisions so the llm's next-action decision tree narrows correctly: - need to make a deck → artifact_create (no auto-run) - need to run it → artifact_run(artifactId) - run failed → artifact_edit(artifactId, patch) → artifact_run again chat-agent.json rule 7 (EN/DE/FR) is rewritten to walk this 4-step cycle and to explicitly forbid calling artifact_create twice for the same request. no schema change, no spawner change, no executeCode action change, no canvas change. the toolCallId race-recovery from a18b5eb stays. runPackages / runOptions are still persisted on the row via the existing initArtifactRun mutation, called once at create time so artifact_run picks them up automatically.
before this patch: - during artifact_create streaming, the canvas right panel showed a bare "RUN" header with nothing beneath it (no status, no files, no errors) because the artifact row has no run state yet — looked broken. - after artifact_create settled, the panel filled with a stale "queued" status that never resolved, because artifact_create called initArtifactRun which forces runStatus='queued' as a side-effect of persisting the run config — but with the split flow no run actually gets queued until artifact_run is invoked. split initArtifactRun into two: - new setArtifactRunConfig — only persists runPackages / runOptions (called by artifact_create after the source settles). - existing initArtifactRun — persists config AND resets runStatus to 'queued' (called by artifact_run immediately before executeCode, same as today). canvas renderer now hides the execution panel entirely while runStatus is undefined and no prior-run artefact (files / stderr / errorCode) is present. once artifact_run kicks off, the panel re-appears with live status. completed-state artifacts still show the panel + file chips as before.
…t duplicate rows
artifact_create's onInputDelta was running this guard:
if (!state.rowInitialized && ...) {
const inserted = await ctx.runMutation(createArtifact, {...});
state.artifactId = inserted.artifactId;
state.rowInitialized = true; // ← set AFTER the await
...
}
the AI SDK dispatches input-deltas back-to-back without waiting for the
prior handler's promise to settle. when two deltas arrive while the
first createArtifact mutation roundtrip is still in flight, BOTH delta
handlers observe `state.rowInitialized === false`, both pass the guard,
and both insert placeholder rows. user ends up with two `v1` tabs of
the same title in the artifact bar — one empty, one finalized — for a
single observed artifact_create tool call.
flip `state.rowInitialized = true` synchronously immediately after the
guard passes, before awaiting the mutation. the second delta now sees
the guard closed and skips the insert. supersedes the toolCallId
race-recovery from a18b5eb for this particular case (that fix was
designed for a different race — state.artifactId being undefined when
execute fires; this is the simpler in-handler check-then-act). the
race-recovery in execute stays for the residual case of execute firing
before the placeholder mutation lands.
- Break a self-referential Convex type in `executeCode` that resolved to
`any` and cascaded 130+ implicit-any errors across the platform: type
the handler return explicitly and use `Id<'sandboxExecutions'>` instead
of `Awaited<ReturnType<typeof ctx.runMutation<...>>>`.
- Replace narrowing `as` casts in new sandbox code with type guards
(artifact-type set, SSE phase, JSON parse) so oxlint stops flagging
them.
- Register `services/sandbox` as a knip workspace; drop `export` from
helpers only used in their own file; delete the unused
`removeVolume` / `harvestOutput` / `CancelResponse` exports.
- Pre-create `tale-sandbox-net` in the smoke-test script (it's external
in compose.yml; the CLI owns its lifecycle, the smoke script didn't).
Tear it down in cleanup.
- Anchor the service-image grep in the smoke + validate-images scripts
to `/tale-${service}:` so `db` no longer matches `tale-sandbox-egress`
via the `db` substring in `sandbox`.
- Suppress AVD-DS-0002 on the two new Dockerfiles that legitimately run
as root (sandbox spawner needs the docker socket; sandbox-egress
entrypoint chowns the log before tinyproxy drops privs).
The previous attempt used inline `# trivy:ignore:AVD-DS-0002` comments
before `FROM` in services/sandbox{,-egress}/Dockerfile, but the misconfig
scanner reports these whole-file rules at line 1 and does not honor
adjacent-line ignores for them — the alerts kept firing on the PR.
Switch to `.trivyignore.yaml` with per-path scoping so each suppression is
narrowly targeted and documented (statement field), and wire the file into
the security.yml Trivy step via `trivyignores:`. Per-path beats `.trivyignore`
plaintext, which would mask the same finding across every Dockerfile.
Also suppress AVD-DS-0026 (no HEALTHCHECK) on sandbox-runtime: it is an
ephemeral one-shot image — the spawner runs it per code_run call, the
entrypoint executes user code, and the container exits.
Image-level scans in build.yml run `vuln,secret` by default (no misconfig),
so no change is needed there.
d906dd0 to
cc5514a
Compare
Closes ~all P0/P1 findings from the two-round /tmp/sandbox-review pass.
Security / control plane
- create-sandbox-service.ts binds the spawner to 127.0.0.1 (was 0.0.0.0)
— docker.sock + unsigned-token would have been remote-root.
- server.ts now exits at boot when SANDBOX_TOKEN is unset, unless the
operator sets SANDBOX_ALLOW_UNAUTH=true. ensure-env auto-mints the
token; tale doctor's required-token check matches the runtime contract.
- server.ts caps /v1/execute body at 256 KB, validates executionId
before registering it, and the cancel route regex now matches the
Convex doc-id alphabet (was UUID-only). req.signal aborts now drive
cancelExecution so a closed SSE stream tears the container down.
- seccomp.json was a 3-line stub that nothing wired up; deleted.
Egress
- tinyproxy.conf.template gains Allow ACLs + DisableViaHeader Yes +
AnonymousHeader stripping XFF/Forwarded/Via/From.
- entrypoint.sh installs iptables REJECT rules for 169.254.169.254 and
RFC1918 ranges (mirrors services/convex/docker-entrypoint.sh).
- Dockerfile + healthcheck switched to a CONNECT probe that fails when
the allowlist is broken; cap_add: [NET_ADMIN] added.
- Egress runs on sandbox + internal — internal provides the NAT path
(tale-sandbox-net is --internal) without weakening the hostname-only
exposure (peers can already reach pypi directly via their own NAT).
Spawner runtime correctness
- cancelExecution wraps dockerKill in a 5s timeout with SIGKILL
fallback so a wedged daemon never hangs the HTTP cancel response.
- SIGTERM handler stops accepting requests, aborts in-flight executions,
and drains for up to 20s before exit (was process.exit(0)
unconditionally — was leaking /var/lib/tale-sandbox/sessions/<id>/).
- Dead volume sweep removed (it filtered tale.sandbox=1 on volumes that
never exist; per-execution storage is a host bind mount, cache
volumes are tale.sandbox-cache=1 and must NOT be reaped). Replaced
with a host-dir mtime sweep over the session root.
- stdout phase parser drains lineBuf on EOF; the final unterminated
PHASE: line no longer leaks into user-visible stdout.
- Six empty catch() blocks across spawn.ts / spawn-util.ts /
spawner_client.ts / internal_actions.ts replaced with console.warn
+ cause (AGENTS.md no-empty-catch rule).
- numEnv rejects whitespace-only and negative values; SANDBOX_PORT
range-checked.
- Filenames in services/sandbox/src/ migrated to dash-case
(docker-args.ts, spawn-util.ts) per AGENTS.md.
Convex state machine
- sandbox/internal_mutations.ts: finalize early-returns when the row
is already terminal (closes the watchdog-vs-action race that was
clobbering SPAWNER_UNAVAILABLE audit data); setRunning is monotonic
across queued → installing → running; watchdog sweeps queued AND
installing AND running rows past the cutoff (a throw between
reserveSlotAndInsert and setRunning no longer leaks a quota slot
forever).
- node_only/sandbox/internal_actions.ts: one failExecution helper
consolidates the six prior failure exits. It finalizes the audit row
AND finalizeArtifactRun (so the canvas spinner can no longer hang
forever) AND cascade-deletes any _storage blobs uploaded for the
doomed run. INPUT_REJECTED replaces the misleading
SPAWNER_UNAVAILABLE code for IDOR-rejected input paths. The
AbortController is now actually .abort()ed in finally.
- artifacts/internal_mutations.ts: patchArtifactRunProgress and
finalizeArtifactRun gain terminal-status guards mirroring the audit
side. initArtifactRun no longer hard-patches runPackages /
runOptions — the artifact_run tool's per-call overrides stay
per-call (was contradicting its own documentation).
Wire-shape consolidation
- New convex/sandbox/wire.ts is the single source of truth for
error-code / run-status / phase / output-file validators + literal
arrays (pattern matches convex/tts/error_codes.ts). sandbox + artifact
schemas + internal mutations + action + spawner_client all import
from it. The 9-member errorCode union previously copied 6 times
is now defined once; INPUT_REJECTED added.
- services/sandbox/src/wire.ts is the spawner-side mirror (the spawner
cannot import from Convex). Centralized ID alphabet regex shared by
server.ts / spawn.ts / docker-args.ts / volume.ts.
- runProgress changes from a server-authored English string to a
structured { kind, package?, version? } shape so the UI can render
localized text via ICU. en/de/fr message files gain the full
chat.canvas.runStatus.* / runErrorCode.* / runProgress.* /
typeLabel.* key set; keys-dynamic.txt registers the four enum-driven
prefixes.
Agent tool surface
- artifact_create_tool deletes the ghost timeoutMs field (artifacts
schema has no runTimeoutMs column; was silently dropped) and the
unused ArtifactCreateRunOutcome interface.
- artifact_run_tool removes the inlined Chinese description example
per feedback_no_locale_terms_in_tool_descriptions; the try/catch
now wraps runQuery (where v.id() throws) instead of toId (which is
a pure cast).
- Stale code_run literals across services/sandbox/package.json,
sandbox-runtime/Dockerfile, doctor.ts, ensure-env.ts replaced with
artifact_run.
Soft-delete unregistration
- sandboxExecution is audit-only; the half-wired soft-delete plumbing
(no by_org_lifecycleStatus index, not in TRASH_VISIBLE_RESOURCE_TYPES,
no fetchTrashSubpage case, no retention cleanup, no storage cascade)
is removed cleanly along with the lifecycleStatus field and the
unused by_org_user / by_threadId indices.
UI / i18n
- canvas-runnable-code-renderer routes every user-visible string
through useT('chat'). The renderer reads runProgress.kind/package/
version and lets ICU format the localized text.
- New canvas/icon-map.ts consolidates the per-artifact-type Icon
Record that lived in three places (artifact-bar.tsx, canvas-pane.tsx,
message-bubble.tsx) with a drift between them; TYPE_LABEL_KEYS,
TYPE_EXTENSIONS, TYPE_MIME_TYPES also moved.
CI / images / smoke
- build.yml + cleanup-pr-images.yml matrices add sandbox, sandbox-egress,
and sandbox-runtime — none of which were built before this commit
(first tale deploy would have failed at the first artifact_run call
with image-not-found).
- spawner calls ensureImage(cfg.runtimeImage) at boot, with retry.
- tests/container-image-test.sh and container-smoke-test.sh include
the three new images; the smoke now exercises a signed
POST /v1/execute end-to-end probe.
- sandbox-runtime/entrypoint.sh stops writing the write-only
install-report.json + install-stdout.log files.
Build / test loop
- services/sandbox added to root workspaces (was outside bun run check
reach). Renamed to @tale/sandbox; lint/format scripts added.
- Dockerfile multi-stage with literal bun.lock + --frozen-lockfile
--production; non-root user where docker.sock permits.
- New services/sandbox/src/server.test.ts (21 tests) covers
cancel-route regex, loadConfig fail-closed defaults, HMAC verify.
- knip.config.ts entry for services/sandbox now anchors on src/server.ts.
Verification
- bun run check: 44/44 tasks, 70 661 platform tests pass, knip clean.
- services/sandbox: 21 tests pass.
Plan: /home/larry/.claude/plans/plan-issues-glistening-peach.md
Review reports: /tmp/sandbox-review/round{1,2}/
The recent multi-file artifact refactor (511e6b3) gave artifacts a files[] map but `artifact_run` could still only execute `entryFile`. LLMs following multi-script skills like Anthropic's pptx (generator + validator) were forced to create separate artifacts per script — exactly what the multi-file model was meant to prevent. This change lets one artifact hold many runnable scripts; the LLM picks which to execute via `artifact_run({path})`, and all sibling files are staged on disk so the executed script can `import` / `require` / `subprocess` them. Wire-shape changes (spawner + platform updated atomically): - ExecuteRequest gains optional `files: SandboxFile[]` + `entryPath: string` - `code` still required and carries the executed file's content; old spawners ignore the new fields and keep working (cross-deploy compat) - Spawner-side validate-request.ts enforces path safety (max 50 files, ≤800KB aggregate, POSIX + ASCII allowlist, no traversal, NUL, BiDi, hidden dotfiles, case-insensitive uniqueness) - New shared regex/caps in sandbox wire.ts (FILE_PATH_SEGMENT_RE, MAX_FILES_PER_REQUEST, MAX_FILE_PATH_LENGTH, MAX_FILES_BYTES) - stageWorkspace writes every file at /workspace/code/<path> with resolve+prefix check (defense in depth), then mirrors the executed content to main.{py,js} — runtime entrypoint unchanged Platform side: - executeCode action passes through `files`/`entryPath` to the spawner - spawner_client SpawnerExecuteBody mirrors the new fields - artifact_run_tool accepts `path: string?` (defaults to entryFile), validates via the platform's 16-rule validatePath, refuses on file_missing / empty_file with structured messages, forwards every file from resolveArtifactFiles to the sandbox - Tool description: explicit "one artifact, many runnable files" guidance with pptx-style example, plus an explicit note that each run gets a fresh container (no cross-run workspace persistence) — validators must be invoked from the generator via subprocess/import, not as a separate artifact_run call Path validation runs in 3 layers: platform tool boundary, spawner request validator, spawner staging loop (resolve+startsWith on codeDir). Verification: `bun run check` clean on @tale/platform (typecheck + lint) and @tale/sandbox (typecheck + lint, modulo the 6 pre-existing lint errors on main in validate-request.ts that are unrelated to this change).
cancelVideoLink early-returned for completed/failed rows, leaving them unbound and re-emitted by listForUserUnboundChat on next page load. The X button's local hideJobIds set is React state and dies on refresh, so the chip kept reappearing. Now flips status='skipped' for any non- skipped row; cleanup's existing messageBoundAt guard keeps bound rows safe.
…reaming Three connected canvas-pane fixes driven by the PPT-generator flow (main.js + verify.js), where running a verifier wiped the generator's download chip and `artifact_create` waited until execute() to show any content. 1. `artifact_create` placeholder-row streaming. onInputStart/onInputDelta parse the partial tool input; once (type, title) are committed, beginCreateStream inserts a revision-0 placeholder row with liveStreamMode='create' + toolCallId. The canvas opens immediately and the existing tool-input-delta -> extractContentDelta path lights up the source view token-by-token. execute() settles via finalizeCreateStream (revision 1 + artifactRevisions row + clear streaming flags), or hands off to createArtifact on collision / type_mismatch. cleanupStaleStreams now deletes revision-0 rows so a crashed onInputDelta does not leak empty artifacts. 2. Multi-file sidebar. New CanvasFileSidebar renders the project's `files[]` (synthesized for legacy rows via resolveArtifactFiles), with an entry-file badge and a pulse dot on the file the LLM is currently writing. Active file lives in CanvasContext; canvas-pane drives renderer / editor / download from the selected file. Source- view mode now triggers only when the streaming path matches the active file (with a fallback to entryFile for legacy streams that don't set streamingPath). 3. Per-file run results. sandboxExecutions records the executed file path. New getLatestRunPerFile query keys run-state UI by activePath so verify.js failing no longer clobbers main.js's runOutputFiles — the .pptx download chip stays reachable even when the user switches to a sibling that produced no output. Output chips are shown for stale runs too (the artifact may have been edited since); the run- status badge still gates on freshness.
…nerating
cancelGeneration used to overwrite the assistant message with a flat
`{ role, content: <string> }` patch. Because the SDK derives UIMessage
parts from message.content at read time, this wiped every non-text part
(file/image cards, reasoning, tool-call, tool-result) the user had
already seen. It also discarded persisted deltas whenever the freeze
snapshot was empty (e.g. mid-remount), turning the bubble into an
aborted shell.
Change the cancel protocol to send displayedLength (a position in chars)
instead of a content string. The backend now truncates the existing
content in place via truncateAssistantContent — text parts cut by
cumulative length, every non-text part preserved at its original index.
When no length is captured but text was already streamed, mark
status=success without touching content rather than vaporising it; only
truly empty messages still get status=failed.
…pt runs
- Spawner generates a wrapper main.{py,js} that subprocess-invokes each
requested step path in sequence inside the same container. Per-step
{path, exitCode, durationMs, status} land in /workspace/output/.tale-steps/
results.json; spawner reads + filters them from the output harvest.
- ExecuteRequest{code?, steps?} are mutually exclusive at the spawner
validator. Step paths must reference files[]; main.{py,js} is reserved
for the wrapper. Cap MAX_STEPS_PER_REQUEST=10.
- sandboxExecutions.steps[] persists per-step results for audit; finalize
mutation forwards the array when present.
- artifact_run tool description rewritten: workspace lifecycle (fresh
every run, /workspace/output/ does not carry across), steps[] is now
the recommended way to chain generator+validator instead of splitting
into two artifact_run calls. path/steps superRefine mutex.
- Result message attributes failures to the specific step
("Step 2/3 (\"validate.py\") exited 1") so the LLM patches the right
file instead of restarting from scratch.
- 8 new spawner tests cover mutex, missing files[], reserved-name
collision, and step cap.
…ew cleanup Closes the cancellation contract gaps surfaced by the multi-agent review: user-clicked Stop now actually stops sandboxes and discards in-flight artifact streams, and sandbox runs stream stdout/stderr to the canvas live instead of dropping the whole 16 KB buffer at terminal time. Cancellation cascade (C1 + C2) - New by_threadId index on sandboxExecutions, listNonTerminalByThread query, cancelExecutionRecord mutation, and cancelExecutionsForThread internalAction. cancelGeneration schedules the action (not awaited; the Stop-ack response is owed to the user immediately). spawn.ts already returns status:'cancelled' on abort, so the SSE result event the in-flight action receives is unchanged. - discardActiveStreamsForThread deletes revision-0 placeholders (artifact_create mid-stream) and clears liveStreamMode on settled rows (rewrite/patch mid-stream). Closes the ~6 min ghost-tile window that cleanupStaleStreams cron was the only previous recourse for. Live stdout/stderr (C5) - New 'stdout'/'stderr' SSE event types in both wire.ts files with a compile-time Equal<> parity guard. spawn.ts emits per-line stdout (PHASE markers stripped) and per-chunk stderr; spawn-util.ts grew the symmetric onStderrChunk hook. - Platform-side action coalesces deltas (250 ms / 2 KB threshold) through the new appendArtifactRunOutput mutation, which is terminal-state guarded, freshness-checked against runExecutionId, and capped at the existing 16 KB preview budget to match the canonical preview finalizeArtifactRun writes. - canvas-runnable-code-renderer's stdout/stderr <details> auto-opens while running/installing and auto-scrolls to bottom unless the user has scrolled away. New LiveTailDetails helper sets `open` via ref so the user can collapse it manually once a run finishes. OCC fix (C4) - expectedRevision dropped .optional() on every artifact_edit mode and the `?? artifact.revision` fallback was removed. The schema and tool description now require the LLM to pass the revision visible in the <artifact revision="N"> system context; OCC is real instead of trivially passing. CI unblock (C3) - Six sandbox lint errors corrected (no-unsafe-type-assertion + no-unnecessary-type-assertion in cleanup.ts / validate-request.ts). - artifacts/internal_mutations.test.ts rewritten: nine tests against the current title-idempotent createArtifact signature plus the new discardActiveStreamsForThread cascade. The pre-511e6b361 test suite is gone — that gate had been silently red ever since the streaming- create split moved. Production-compose port (C6) - 127.0.0.1:8003:8003 publish moved out of createSandboxService into generate-dev-compose only. Stateful prod no longer exposes the spawner on host loopback; Convex reaches it through the internal Docker network. A11y (C7) - segmented-radio buttons grew focus-visible ring tokens (fg-base / bg-elevated) — keyboard nav on /pricing and /hardware-pricing is no longer reliant on suppressible UA defaults. Dead-code cleanups - D1 deleted unwired updateStreamingContent mutation along with its orphan scheduleStreamingFlush/drainFlush helpers and the pendingFlush/flushInFlight state in stream_state.ts. - D2 removed the void mirrorLegacyContent tree-shake-shield in artifact_read_tool; the function has 16 real call sites in mutations.ts / internal_mutations.ts / snapshot_for_branch.ts. - D3 installOptions sandbox schema field marked @deprecated post-R2-B4 (kept per feedback_deprecate_dont_delete_schema_fields). - D4 dropped export keyword on three sandbox-internal-only symbols (NONCE_TTL_MS, releaseSpawnerLock, ValidateResult) so knip stays clean. - D5 deleted the unreachable "Only generate a PDF if the user explicitly insists" sentence from chat-agent.json (de/en/fr). - D6 documented the 'running' status literal asymmetry in convex/sandbox/wire.ts (kept for legacy rows; new audit-row writes emit 'installing' only). Bonus - snapshot_for_branch now branches at the in-scope revision's content instead of the source row's current state. create_branch_thread walks artifactRevisions, threading revisionFiles / revisionEntryFile / revisionContent through to the snapshot function; legacy content-only rows synthesize a single-file artifact. Closes the pre-existing create_branch_thread_artifacts test failure that was keeping bun run check red. Verified: bun run check green end-to-end — 44/44 turbo tasks, 70 684 tests. Sandbox suite 67 pass. The rewritten internal_mutations.test.ts adds 9. The branch-artifact suite (5) now passes the "later edits out of scope" case it was already asserting.
Symptom: chat UI froze while streamDeltas kept growing. Root cause is
the agent SDK rebuilding the full UIMessage from cursor=0 on every
Convex push, which becomes O(N²) when a large artifact_create emits
hundreds of uncompressed tool-input-delta rows.
UX
- Drop v{revision} badges from artifact bar and message bubbles; show
file count instead. Remove listRevisions query (no UI consumer).
- artifact_create soft-conflicts on a second call in the same
assistant message (conflict: 'already_created_in_message'), steering
the model toward artifact_edit. Tool description leads with DEFAULT
TO ONE ARTIFACT PER REPLY. New compound index
by_organizationId_thread_createdByMessageId backs the lookup; guard
gates on non-empty messageId to avoid cross-matching multi-step or
sub-agent edge cases that fall back to "".
- Canvas file sidebar gains a "+" button that reuses the existing
userEdit mutation, so users can add files without going through the
LLM. Sidebar now renders for any file count (not just > 1).
Streaming
- saveStreamDeltas.throttleMs 100 -> 250 (SDK default). Tale has no
inter-push smoothing layer, so 500ms felt chunky on fast streams;
250ms cuts stream row volume ~2.5x without visible regression.
- patch-package extends the agent SDK's compressUIMessageChunks with a
tool-input-delta merge branch (mirrors the text-delta merge). Submit
upstream and drop the patch on the next SDK bump.
Verification
- 70,688 unit tests pass; new internal_queries.test.ts covers the
guard query (hit, miss, empty-messageId short-circuit, scoping).
- stream_throttle.test.ts updated to assert the new value plus a
documented [100, 400] band.
Symptom: during a long artifact_create stream the canvas would visibly fill with content, then flash back to blank — and on LLM retry / "create in segments" the model started over from zero. Root cause: content tokens were only accumulated in client-side state and persisted to the DB once at execute time. Any interruption between begin and finalize lost everything that had been streamed. artifact_create - onInputDelta is now two-phase: Phase 1 commits the placeholder once type+title are known (unchanged); Phase 2 keeps parsing on every subsequent delta and flushes the parsed partial `content` into the row's `streamingContent` via a new internal mutation, throttled by the existing `shouldFlush` / `markFlushed` primitives from `stream_state.ts`. - The execute-time catch no longer unconditionally discards the placeholder. If it has already received some `streamingContent`, it stays — a later `artifact_create` with the same title takes the collision path and surfaces the partial state to the model rather than restarting from zero. Truly abandoned placeholders still get swept by the existing `by_liveStreamMode` cleanup cron. artifact_edit (rewrite mode) - Symmetric Phase 2 after beginEditStream: every parse pass with a growing `content` field flushes to `streamingContent`, gated on toolCallId + streamingPath matching the row's current session. The canvas's `hasDeltas ? streamedContent : (streamingContent ?? settled)` precedence already preferred `streamingContent` when the client-side tool-input-delta hook was empty — it just never had any bytes to show. With the server now writing during streaming, the fallback works as designed and the blank window disappears. Verification: 70,698 unit tests pass; new mutation tests cover the happy path + every guard (missing row, toolCallId mismatch, wrong mode, streamingPath mismatch); new wiring test confirms onInputDelta calls beginCreateStream then updateCreateStreamingContent in the same parse pass for fully-formed JSON, and that small subsequent growth is throttled.
…collapsible Previously the execution panel lived inside CanvasRunnableCodeRenderer and was keyed by the sidebar's `activePath` via getLatestRunPerFile. Switching to a sibling file (e.g. validate.py from generate.py) made the entry file's run output, including the .pptx download chip, disappear until the user clicked back. Hoist the panel to canvas-pane as a project-level fixture: - New `listRunsPerFile(artifactId)` query returns one projection per declared file with an execution row, ordered entry-file first, then declared `files[]` order. Older runs per path are collapsed; rows whose path is no longer in `files[]` (file deleted via canvas) are dropped. Legacy single-file artifacts with no per-file rows fall back to the artifact row's `run*` fields. Algorithm split into a pure `selectRunsPerFile` helper for unit testing. - New `RunResultPanel` subscribes once at the canvas level and renders the entry file's result inline (same chrome the legacy renderer used: status badge, error block, output files, stdout / stderr LiveTailDetails). Non-entry files' results render inside the existing `CollapsibleDetails` primitive — summary shows path + status badge, body expands the same chrome. Panel returns null when nothing's worth showing, matching legacy "stay quiet" UX. - `StatusBadge` / `FileChip` / `LiveTailDetails` and the run-detail body extracted into `run-result-helpers.tsx` so the new panel and any future consumer share one rendering source. The runnable code renderer is now a thin source-only wrapper around CanvasCodeRenderer. - `getLatestRunPerFile` is removed (sole caller was the panel that no longer needs it). Convex regenerates `_generated/api.d.ts` via `typeof` module reference. - i18n: three new `canvas.*` keys (en / de / fr): `runResultEntryLabel`, `runResultSecondaryLabel`, `runResultSecondaryCount` (ICU plural). - Sidebar selection now only swaps the source code below; the run panel above is stable and independent. Verification: 70,705 unit tests pass; new `queries.test.ts` covers the pure helper across 7 cases (entry-first ordering, repeat collapsing, deleted-file filtering, runProgress/runRevision mirroring only onto the active row, legacy fallback, empty state, no-path rows).
…/output/
Reproduces the user-reported failure: a multi-file python_runnable with
generate.py (writes /workspace/output/foo.pptx) and validate.py (reads
it back through markitdown) — when the LLM made two separate
artifact_run calls instead of using steps:[...], validate.py died on
FileNotFoundError because every artifact_run gets a fresh Docker
container with an empty /workspace/output/. The output file lived on
in artifact.runOutputFiles[] (in _storage) but was never re-staged.
Make separate calls forgiving:
- Wire: ExecuteRequest gains optional `priorOutputFiles: Array<{name,
contentBase64}>` (same envelope shape as harvest returns).
- Spawner: new `stagePriorOutputFiles(outputDir, files)` writes each
base64 entry into /workspace/output/ before the container starts.
Path-traversal guard (resolve + outputDir-prefix check) skips unsafe
names without aborting the run; bad entries are logged. Exported for
unit testing.
- Convex `executeCode` action: when artifactId is set, looks up the
artifact, sums runOutputFiles sizes, and forwards them to the spawner
if total ≤ 10 MiB. Over the cap → skip the pre-stage and inject a
one-line note via the existing stderr tail channel so users see why.
Load failures are logged + non-fatal — pre-staging is a best-effort
backstop, not a contract.
- artifact_run tool description: WORKSPACE LIFECYCLE section now says
recent outputs ARE pre-staged (with the cap), still nudges toward
steps:[...] as the canonical idiom for tightly-coupled chains.
steps:[...] remains the right pattern (one container, atomic outcome,
fail-fast). Pre-staging is the safety net for when the LLM forgets.
Verification: 70,705 platform tests pass, 7 new spawner-side bun:test
cases cover the path-traversal guard (../escape, /abs-escape), nested
paths, binary round-trip via base64, and empty-list no-op. The Convex
action's branch is integration-tested via the manual smoke described in
the plan (run generate.py separately, then validate.py — second run
sees the .pptx).
…e error
Symptom: streamDeltas keep growing while the chat UI looks frozen and the
Convex logs flood with `beginEditStream {code: 'streaming_in_progress'}`
on the same artifact, 30+ entries at the same second.
Root cause was in `d3ef05680` ("persist artifact streaming content
incrementally"): on an `artifact_create` execute error with non-empty
`streamingContent`, we skipped `discardCreateStream` to preserve the
partial content. The intent was that a follow-up same-title create would
recover via the collision path. But the row was left with revision=0
AND liveStreamMode='create' AND toolCallId set — any subsequent
`artifact_edit` then hit `beginEditStream`'s streaming_in_progress
refusal until the cleanup cron's stale threshold elapsed (minutes).
`artifact_edit_tool.ts`'s `onInputDelta` caught the rejection and
returned with `state.rowInitialized = false`, so every ~40 ms parse
pass retried the same mutation — hence the log flood and the
appearance of a frozen UI (Phase 2 flush never fired either).
Fix at the root, plus a defensive shield on the edit side.
§1 Settle, don't strand:
- New internal mutation `settleStrandedCreateStream(artifactId, toolCallId)`:
* revision=0 + create-mode + non-empty streamingContent → promote
to revision-1 (files=[{entryFile, content: streamingContent}],
clear streaming flags, insert artifactRevisions row with editKind
'create').
* revision=0 + empty buffer → delete the row (mirror of
discardCreateStream's old behavior).
* Already-settled or different mode → clear streaming flags only.
* toolCallId mismatch → no-op (don't touch another stream's row).
- Replace the lookup→if-content→skip-discard dance in artifact_create's
execute catch with a single call to the new mutation. The decision
lives server-side now; no client-side race window between read and
conditional patch.
§2 Break the edit-side retry loop:
- New optional `beginEditStreamFailed` flag on `ArtifactStreamState`.
- `artifact_edit_tool.onInputDelta`: short-circuit at the top of Phase 1
when the flag is set; stamp it in the `beginEditStream` catch.
- `artifact_edit_tool.execute`: read the flag and return a structured
`{success: false, code: 'streaming_in_progress', message}` so the LLM
gets the right recovery hint instead of falling through to the OCC
path.
Verification: 70,712 platform tests pass. New `settleStrandedCreateStream`
cases cover promote / delete / mismatch / settled-row / missing-row.
New `artifact_edit_tool.test.ts` confirms a permanent beginEditStream
rejection calls the mutation exactly once across four onInputDelta
invocations (and that the happy path still flushes via
`updateRewriteStreamingContent`).
Out of scope (separate follow-ups):
- `cleanupStaleStreams` cron behavior — kept as-is (delete revision-0
past threshold); §1 prevents stranding in the first place, the cron
is the long-tail safety net.
- LLM-side system prompt guidance to react to `streaming_in_progress`.
…ked-write 'append' mode
User has been hitting `streaming_in_progress` errors on `beginEditStream`
repeatedly despite multiple incremental fixes. The root pattern: the
`artifact_create` streaming-placeholder window is fundamentally fragile
— whenever the create's execute doesn't run cleanly (action mid-flight,
AI SDK retries, network blip, atypical step transitions), the artifact
stays in `liveStreamMode='create'` and blocks any subsequent
`artifact_edit` until the cleanup cron sweeps it minutes later.
Retire the streaming-create machinery entirely and route all content
delivery through `artifact_edit`. Add a new `append` mode so the LLM
has a proper chunked-write primitive for long files instead of being
forced to emit the whole content in a single huge tool input
(re-creating the fragility on the edit side).
artifact_create
* Drops the `content` argument entirely.
* No `onInputStart` / `onInputDelta` hooks. The tool is purely
synchronous metadata: type, title, language, entryFile, packages.
* `execute` keeps the same-message guard, then calls `createArtifact`
once and returns its result. The row always lands directly at
revision 1 with an empty entry file.
* Success-response `message` includes a concrete copy-pasteable
next-step hint pointing the LLM at `artifact_edit({mode:'append'})`.
* Description rewritten: removes all "content REQUIRED" language;
surfaces the new create-then-append/rewrite workflow.
artifact_edit — 5 modes total (3 content + 2 file-tree):
* `append` (new) — concat `content` to the end of a file at `path`;
creates the file if missing. Each call bumps revision; the LLM
delivers a long file via N small calls. Preferred over `rewrite`
for files >~10 KB or multi-turn delivery. OCC via `expectedRevision`
handles retry de-duplication (second land returns `code:'stale'`).
* `set_entry` retired. Common case (re-point entry on rename) is
already handled atomically by `rename`'s `from === entryFile`
follow-along; the rare swap-between-existing-files corner is doable
with a two-step rename. Saves the LLM one mode of cognitive load.
* `rewrite` / `patch` / `delete` / `rename` unchanged.
Backing mutations
* New `appendToFile` — mirror of `rewriteArtifact` but the file's new
content is `existing + args.content` instead of replacement. Audit
row uses `editKind: 'append'` (added to the schema validator).
* `beginEditStream` now seeds `streamingContent=''` for both
`rewrite` and `append` so the canvas's `streamingContent ?? settled`
fallback chain works identically across both content modes.
* `updateRewriteStreamingContent` accepts both `rewrite` and `append`
live modes (same wire shape, same per-tool-call live preview path).
* Deleted: `beginCreateStream`, `finalizeCreateStream`,
`updateCreateStreamingContent`, `discardCreateStream`,
`settleStrandedCreateStream`, `setArtifactEntry` (~380 LoC).
* `set_entry` literal kept in `artifactEditKindValidator` for
read-validator compatibility with existing rows (per
feedback_deprecate_dont_delete_schema_fields).
Schema:
* `artifactEditKindValidator` gains `'append'`.
* `liveStreamModeValidator` gains `'append'` (same wire shape as
`rewrite` — content streams in via tool-input).
Tests:
* Deleted `artifact_create_tool.test.ts` (the streaming hooks it
covered are gone).
* Trimmed `internal_mutations.test.ts`: removed
`updateCreateStreamingContent` and `settleStrandedCreateStream`
describe blocks; extended the mock's query builder with `.order()`
so trimRevisionHistory works inside the test.
* Added 5 `appendToFile` cases: concat happy path, create-if-missing,
stale OCC rejection, not_found, and a multi-call sequential flow
asserting the final revision and concatenated content.
Net: ~700 LoC deletion, the recurring streaming-create bug class
disappears, the LLM gets a natural way to handle long files. 70,705
platform tests pass; 0 lint warnings.
Out of scope (separate follow-ups):
- Streaming-flag schema field cleanup (still in use by rewrite/append/patch).
- `cleanupStaleStreams` cron (long-tail safety net; behaviour unchanged).
- LLM-side system prompt nudges to prefer `append` over `rewrite`.
…dit 'append' mode Follow-up to c5350c8 ("retire streaming-create + add 'append' mode"). The Convex layer wires `liveStreamMode='append'` correctly, but canvas-pane.tsx still gates its content-streaming logic on `liveStreamMode === 'create' || === 'rewrite'` only — `'append'` was missed, so an in-flight append left the canvas main area blank even though the file sidebar's streaming dot lit up (the sidebar uses a mode-agnostic `path === streamingPath` check). Add `liveStreamMode === 'append'` to both content-streaming disjunctions: * `isStreamingActiveFile` — drives the 3-tier fallback chain (live tool-input-delta → `streamingContent` → `settledContent`) and the streaming caret. * `showStreamingSource` — the source-view gate that keeps the pre-render code view on while a content stream is in flight. `useStreamedArtifactContent`'s tool-input-delta decoder is shape- identical for `rewrite` and `append` (both ship `content` at top-level of the parsed tool input), so the existing client-side path just works once these two gates allow it through. No other frontend file needs touching — grep confirmed every other reference to `liveStreamMode` is either mode-agnostic (`!== undefined`), explicitly `'patch'`, or a pure path comparison. Verification: 70,705 tests still pass. Manual smoke for the user's case (WiSeKey-style large pptx via append-mode chunked writes) should now show content streaming in token-by-token as each tool-input delta arrives — same UX as the historical rewrite mode.
…artifact runs A failed or cancelled run was unconditionally writing `runOutputFiles: []` into the artifact row (both at `initArtifactRun` and inside the empty `applyFinalizeArtifactRun` patch). Because pre-staging for the next run reads the artifact row's `runOutputFiles`, any failed intermediate run wiped the prior successful run's outputs — so a subsequent `artifact_run` that depended on those files hit a `FileNotFoundError` for a file that demonstrably existed. `applyFinalizeArtifactRun` now only writes `runOutputFiles` when the run completed OR the harvest produced at least one file; `initArtifactRun` no longer clears the field on start. Successful runs still replace atomically, so the only behavior change is "failed/cancelled run no longer destroys history."
…teral closes
`artifact_edit`'s `onInputDelta` was calling `beginEditStream` as soon as
`parsePartialJson` surfaced any `path` value — but a partial parse on
`{"path":"c` would auto-repair the string and surface `path: "c"`. Every
subsequent delta committed a new intermediate name ("cr", "cre", ...),
producing the visible filename flicker in the Canvas FILES panel.
Added `isPathFieldClosed` to scan the raw accumulator for the value's
unescaped closing `"` before allowing `beginEditStream`. Once the path
literal has closed it cannot regress (JSON is written linearly), so the
gate is one-way: the first committed `streamingPath` is the final one.
The Phase-2 content-flush hook is already transitively gated by
`state.rowInitialized`, so it inherits the fix without further change.
…pers + handler modules
internal_mutations.ts had grown to 1307 lines mixing internalMutation
wrappers, args/returns validators, and full handler bodies — hard to scan,
hard to extend, and the per-mutation contract was hidden inside ~150-line
handler bodies.
Split into four concern-scoped modules under handlers/:
- shared.ts: helpers, size guards, validateFiles,
clearStreamingFlags, trimRevisionHistory
- content_edits.ts: createArtifact, applyToolPatch, rewriteArtifact,
appendToFile, deleteFileFromArtifact,
renameFileInArtifact
- streaming.ts: beginEditStream, abortStream,
updateRewriteStreamingContent,
discardActiveStreamsForThread, cleanupStaleStreams
- run_state.ts: setArtifactRunConfig, initArtifactRun,
appendArtifactRunOutput, patchArtifactRunProgress,
finalizeArtifactRun (+ the pure
applyFinalizeArtifactRun helper)
Each handler module exports a `xxxArgs` / `xxxReturns` / `xxxHandler`
triple; the wrapper file now just imports and assembles them via
`internalMutation(...)`. Existing call sites are unaffected — the
internalMutation export names + shapes are preserved, and the two
cross-module helpers (`MAX_ARTIFACT_BYTES`, `assertAggregateSize`,
`applyFinalizeArtifactRun`) are re-exported for back-compat. The vitest
suite's `internalMutation: (config) => config` mock continues to work
unchanged because `.handler` still resolves to the registered function.
internal_mutations.ts: 1307 -> 197 lines.
…ables + backfill
Phase 1 of the plan in llm-majestic-hamming.md. Pure additive — the new
tables ship alongside the existing `artifacts.files[]` /
`artifacts.runOutputFiles[]` fields, which remain in the schema as
`@deprecated` per [feedback_deprecate_dont_delete_schema_fields] so
historical rows continue to pass the read validator unchanged. No read or
write paths are switched in this commit; later phases will migrate
callers table-by-table while the deprecated fields keep working as a
fallback.
New tables (all camelCase, registered in convex/schema.ts):
- artifactFiles one row per source file, keyed by (artifactId, path);
carries optional streamingWriteToolCallId pointer
the canvas will use to find streamDeltas for live
content rendering once the new write paths land
- artifactRuns one row per execution attempt, append-only; carries
optional inputsFromRun for cross-run pre-staging
- artifactRunFiles append-only output files per run; failed runs that
harvested partial files will still land rows here
(resolves D5 in the plan once the spawner is updated)
Backfill: migrations/backfill_artifact_files_table.ts (single paginated
internal mutation, follows the existing migrations/* convention). Reads
each artifact's `files[]` into `artifactFiles` and — for artifacts with
terminal `runStatus` — synthesizes one `artifactRuns` row plus a
`artifactRunFiles` row per output. Idempotent (every step probes the
target index first), safe to re-run.
Deprecation pass on artifactsTable: marked `files`, `runOutputFiles`,
`streamingPath`, `streamingContent`, `liveStreamMode`, `toolCallId`,
`liveStreamStartedAt`, `streamingPatches` with @deprecated JSDoc explaining
the replacement and the migration window. Fields stay on the row; new
code reads/writes the new tables once subsequent phases switch over.
…stage migrate to new tables Phase 2 of the plan in llm-majestic-hamming.md. Three concerns wired together end-to-end while keeping the deprecated artifacts.runOutputFiles field as a read fallback (per [feedback_deprecate_dont_delete_schema_fields]). 1. spawner harvest on failure (D5): `spawn.ts` now harvests `/workspace/output/` on every exit path — completed, cancelled, and failed — wrapped in try/catch so a stat error never trumps the underlying failure signal. Partial files the user script managed to write before crashing now reach the platform and land in artifactRunFiles (status=failed), so the user can see what was produced before the crash instead of getting nothing. 2. finalize dual-write: `applyFinalizeArtifactRun` still patches the artifact row's run-state fields (including artifacts.runOutputFiles when the harvest had output, per the Fix-1 guard), AND now also inserts one artifactRuns row + one artifactRunFiles row per output. Append-only — failed and cancelled runs leave their row in place for future history queries. The artifact row's run-state fields remain the canonical source for in-flight (queued/installing/running) state. 3. pre-stage read path: New getLatestRunOutputs internal query reads the most recent completed artifactRuns row + its artifactRunFiles, falling back to the legacy artifacts.runOutputFiles field for artifacts whose data hasn't been backfilled. executeCode in node_only/sandbox/ internal_actions.ts now calls it instead of reading runOutputFiles directly — the migration window keeps both paths working. Behaviour change visible to users today: a failed run that was previously preceded by a successful one no longer destroys the prior outputs at pre-stage time (Fix-1's guard already prevented destruction on the artifact row; this change ensures the new artifactRuns table preserves them across the cutover), and any partial files written before a crash now show up in the per-run artifactRunFiles set.
…ackages_add Two small surface additions to existing tools. Originally the plan also split artifact_edit's rewrite/append modes into separate _open / _write tools (Q4 of the original plan), but D2's flicker had already been solved structurally by the isPathFieldClosed gate (commit 94a9249), so the split would have doubled LLM round-trips without any user-visible gain. Reverted before commit. artifact_run: - New optional `inputs?: { from_run: "latest" | "<runId>" }` arg. Default behaviour unchanged ("latest succeeded run"); explicit runId pins pre-staging to that exact prior run regardless of status. - Threaded into executeCode → getLatestRunOutputs query, which now accepts an optional fromRun; resolves the runId via `ctx.db.normalizeId` so a malformed id silently falls through to the default path instead of crashing the action. - Return shape carries the new `runId` (resolved via `getRunByExecutionId` index on artifactRuns.executionId) so the LLM can reference history. Best-effort: omitted if finalize never ran. - New `by_executionId` index on artifactRuns backs the lookup. artifact_edit: - `rewrite` and `append` modes now accept optional `packages_add`. On success, unions the listed names into the artifact's persistent `runPackages` (dedup, case-sensitive); no-op if all listed entries were already present. Existing entries are never removed. - New `addArtifactPackages` internal mutation handler + thin wrapper; `artifact_edit`'s execute() calls it after the underlying rewrite / append succeeds. Best-effort: a package-update failure does NOT flip the edit's success status. Surface unchanged for non-runnable artifacts; `packages_add` is silently ignored when the artifact type is not python_runnable / node_runnable.
…ct_edit/artifact_read
LLM-facing tool surface migrates from modal artifact_edit (5 modes:
rewrite/append/patch/delete/rename) to single-responsibility CRUD:
- file_create / file_update / file_delete / file_rename — file-level writes
with explicit path_exists / file_missing guards and OCC via expectedRevision
- file_read / file_list — explicit paths required; no "smart inline" aggregate
- artifact_packages_add — runtime-dependency mutation separated from file ops
Cuts the "single file too big" anti-pattern by guiding the LLM (via the
shared artifacts system prompt) to split into multiple files instead of
chunked appends.
Schema / handler changes:
- Per-file dual-write into artifactFiles table via new syncArtifactFiles
helper, called from every settle path
- New createFileInArtifact / updateFileInArtifact handlers replace the
upsert rewriteArtifact; old applyToolPatch / rewriteArtifact / appendToFile
handlers + apply_patches.ts deleted
- Canvas getById switched to loadArtifactWithFiles, reading artifactFiles
as the authoritative source with doc fallback for legacy rows
Streaming fix:
- beginEditStream's row-level single-writer mutex removed — concurrent
file_create / file_update calls to different paths no longer fail with
spurious streaming_in_progress. Same-path collisions are still caught by
OCC at settle time; canvas streaming UI shifts to last-writer-wins.
Cleanup:
- artifact_edit_tool / artifact_read_tool deleted + unregistered
- Schema validators for historical editKind values (patch / rewrite /
append / set_entry) preserved per feedback_deprecate_dont_delete_schema_fields
- All LLM-facing prose (tool descriptions, error messages, build_artifacts_context)
rewritten to advertise file_* CRUD
- docs/{en,de,fr}/platform/workspace/canvas.md updated
…te shared system prompt Two cleanups on top of the strict-CRUD surface from 2088df6: 1. **Namespace tools by artifact scope.** All six file-level CRUD tools (file_create / file_update / file_delete / file_rename / file_read / file_list) operate on files INSIDE an artifact, not standalone files. Renaming to artifact_file_* puts them in the same prefix as their sibling artifact_create / artifact_list / artifact_run / artifact_packages_add, makes the scope explicit, and frees the bare file_* namespace for any future non-artifact file primitives. 2. **Move file-convention guidance out of the shared system prompt.** The "MULTI-FILE PROJECTS: split logically separate concerns..." block in build_artifacts_context.ts was sent to every agent on every turn, regardless of whether the agent had write tools registered. That's noise for query/RAG-only agents. The same guidance lives in the artifact_file_create / artifact_file_update / artifact_create / artifact_run tool descriptions — which only ship when the relevant tool is in scope — so the shared block can drop the duplication. Schema validator entries 'file_create' / 'file_delete' / 'file_rename' on artifactEditKindValidator are unchanged — they are persisted in historical artifactRevisions rows and must continue to parse. docs/{en,de,fr}/platform/workspace/canvas.md updated to use the new tool names.
After 6a826e2 retired artifact_edit and renamed file_* tools to artifact_file_*, examples/agents/chat-agent.json was left referencing the removed artifact_edit tool in both its toolNames array AND in three locales of systemInstructions. The chat agent would have: - failed to load artifact_edit (tool-not-found warning, no working modification path) - lacked every artifact_file_* tool entirely - still steered the LLM toward artifact_edit in the PPTX recovery and HTML revision flows baked into its system prompt Updated: - toolNames: drop artifact_edit; add artifact_packages_add + 6 artifact_file_* tools - systemInstructions (en / de / fr): rewrite the PPTX 3-tool sequence as a 4-tool sequence (create → file_update → run → file_update on failure); rewrite the HTML path to use file_update; replace artifact_edit references in the shared guardrail with artifact_file_update / artifact_file_create
… artifactId literal
`parsePartialJson` auto-closes in-flight strings, so streaming deltas were
firing `getById` with truncated artifactIds, tripping `v.id("artifacts")`
and spamming WARN logs. Generalise the existing path-field-closed gate
into `isStringFieldClosed(accumulator, fieldName)` and apply it to both
the `artifactId` preflight and the existing `path` check.
Previously, editing source code after a run cleared the status badge, stdout, stderr, and error block — only output files survived. The panel looked empty even though we still had everything to show. Flip the freshness gate from "hide" to "annotate": `isStale` (replaces `isRunFresh`) now signals when `runRevision` diverges from `artifactRevision`, and `StatusBadge` renders a secondary "Source edited" chip next to terminal-state badges. In-flight runs keep their spinner unadorned — the user choice is to treat a running execution as still legitimately progressing, not stale. Adds `canvas.runStale` to en/de/fr (de-CH inherits from de).
The sandbox runtime image hardcoded `exec python3 /workspace/code/main.py`,
forcing the spawner to write either user content or its multi-step wrapper
to that fixed path. Users couldn't name any artifact file `main.py` even
though that's the most natural Python entry-script name — the LLM's
`steps: [{path:"main.py"}, {path:"test.py"}]` flow was outright rejected.
The constraint protected the spawner's staging convention, not any real
semantic mistake. It's gone now:
- Runtime entrypoint takes the script path as a 4th positional arg and
exec()s it; user files run at their declared paths so tracebacks /
`__file__` carry the real filename, no synthetic mirror.
- Multi-step wrapper moves to `/workspace/.tale/runner.{py,js}` — a dir
whose dotfile segment is already unreachable from user paths, so there
is no collision surface and no user-facing reservation.
- `code` field removed from the spawner wire (`ExecuteRequest`,
`executeCode` action, `SpawnerExecuteBody`); `entryPath` is required
when `steps` is absent. Validator + action mutex updated to match.
Tests pin the user-reported workflow (`steps: ['main.py', 'test.py']`)
as a regression gate and assert the new staging layout (no synthetic
`main.py`, wrapper under `/workspace/.tale/`).
- Use head/tr for byte-cap test payloads instead of a 5 MiB bash brace-expansion loop that intermittently timed out on shared CI. - Remove unused SandboxSseEvent export from wire.ts. - Drop stale convex/_generated/** knip ignore (no longer needed).
LLM calls to artifact_file_update for ~22KB content hit the 8192-token output cap mid-string, producing an unrecoverable JSON parse error before the handler ran. Two-layer fix: raise the agent fallback cap to 32768 (was 8192) to give realistic headroom for tool-call arguments, and rewrite the file-write tool descriptions with a HARD size limit (~12KB / ~400 lines) plus explicit split-the-file workflow (e.g. slide1.js + slide2.js requiring into main.js) so the model decides to split BEFORE generating the call instead of getting truncated.
After raising the agent fallback maxOutputTokens to 32768, the ~12KB guidance was using only ~10% of the actual output budget and would prompt unnecessary file splitting for clearly safe writes. 40KB / 1000 lines aligns with the common best-practice line ceiling and still leaves ~65% headroom of the 32K cap for preamble / thinking / wrapping.
The platform forwards prior-run output files (base64-encoded, up to 10 MB raw / ~13.5 MB after base64 inflation) inline in the request body so the spawner can pre-stage them for the next run. The previous 256 KB cap was sized for "agent-authored code + small input file set" and never accounted for binary artifact outputs (PPTX/PDF/images) getting fed back in, so re-running any artifact that had already produced a non-trivial output file failed with HTTP 413 "sandbox spawner refused payload — request body exceeds spawner cap". 20 MB aligns with the upstream priorOutputFiles cap (MAX_PRIOR_OUTPUT_BYTES = 10 MB) after base64 overhead plus room for source files + JSON wrapper. Still sits well below the spawner container's 512 MB mem_limit so a single oversized POST cannot OOM the process — the original DoS-guard intent is preserved.
…oads work `listRunsPerFile` was projecting `sandboxExecutions.outputFiles` straight to the canvas, but that field is the audit row and intentionally omits `storageId`. <FileChip>'s `useFileUrl(storageId)` therefore returned null, leaving `<a href="#">` and a non-functional download button. Look up `storageId` per file via `fileMetadata` after the pure projection runs. Keeps `selectRunsPerFile` synchronous (unit tests unchanged) and fixes existing rows without a backfill.
Summary
Adds an isolated code-execution sandbox and folds it into the artifact system as a new class of runnable artifacts (
python_runnable/node_runnable). The LLM can now author, edit, and execute Python or Node scripts inside the chat canvas, with generated files surfaced as downloadable chat-card parts.What's new
New services
services/sandbox/— Bun-based spawner that manages containerized executions (quota, slot reservation, watchdog, output streaming).services/sandbox-runtime/— hardened Docker image (Python + Node) used as the per-execution container.services/sandbox-egress/— tinyproxy-based egress proxy isolating sandbox network traffic.Convex backend
convex/sandbox/— execution schema (sandboxExecutions), internal mutations/queries, output ingestion, cron-driven cleanup, governance hooks.node_only/sandbox/— internal action layer + spawner client that talks to the sandbox service.Agent tools
python_runnable,node_runnable) integrated intoartifact_create/artifact_edit.artifact_runtool — explicit, LLM-driven execution trigger. Separating authoring from execution prevents the LLM from spamming duplicateartifact_createcalls on run failure.chat-agent.jsonupdated (en/de/fr) with the 4-step author → run → edit → re-run cycle.Canvas UI
canvas-runnable-code-renderer.tsx— execution panel (status, stderr preview, generated files) stacked above the source pane.canvas.runDonei18n key (en/de/fr).CLI
tale doctor— diagnostics for sandbox prerequisites (docker, network, images).tale start/tale deployintegrate the sandbox + sandbox-egress services; dev compose publishes port 8003 forbun dev.Pre-PR checklist
bun run check(format, lint, typecheck, all tests). — not yet run; please run before mergeservices/platform/messages/{en,de,fr}.json(newcanvas.runDonekey)./docs/{en,de,fr}/for every user-visible change. — pending: new agent tool, new artifact types,tale doctor, sandbox env vars (SANDBOX_URL,SANDBOX_TOKEN)bun run --filter @tale/docs lintandbun run --filter @tale/docs test. — N/A until docs are addedREADME.md,README.de.md,README.fr.md. — likely needed (new services in the stack)Test plan
bun run checkis green.tale doctorreports OK on a clean dev machine.tale startbrings up sandbox + sandbox-egress alongside the existing stack; ports/network as expected.artifact_create(python_runnable)→artifact_run→ canvas shows stdout/stderr + downloadable files; no duplicate v1 tabs in the artifact bar.artifact_edit→artifact_runcycle (not anotherartifact_create).runErrorCodeand a stderr preview.sandboxExecutionsaudit rows are written and cleaned up by the cron.Summary by CodeRabbit
New Features
doctorcommand and sandbox network setup helpers; env tooling emits sandbox tokenBug Fixes
Tests