[WIP] feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration)#779
Conversation
|
Here are 4 issues my Cursor PR review spotted, with specific lines of code referenced and possible solutions explained: Issue 1: LLM provider/model settings are not persisted to DynamoDBThe API accepts LLM fields and the classifier reads them from the Laboratory row, but create/update never write them. The update handler spreads update-laboratory.lambda.ts API keys go to SSM (lines 126–143), but provider/model do not. Create has the same gap — laboratoryService.add() omits all four LLM fields: create-laboratory.lambda.ts ImpactSaving LLM settings in the UI stores API keys in SSM but leaves provider/model unset on the Laboratory row, so LLM classification is skipped (lines 116–118 return null). Selecting "None" also cannot clear an existing provider, because those fields are never written on update. Suggested fixIn both create-laboratory.lambda.ts and update-laboratory.lambda.ts, map HealthOmicsLlmProvider, HealthOmicsLlmModelId, SeqeraLlmProvider, and SeqeraLlmModelId from the request into the add/update payload. For disable, explicitly clear them when the UI sends undefined (e.g. use null sentinels or DynamoDB REMOVE for cleared fields, since removeUndefinedValues: true would otherwise preserve stale values via ...existing). Issue 2: HealthOmics statusMessage is discarded before classificationGetRun exposes both failureReason and statusMessage, but they are merged into one field and only that is persisted. Snapshot collapses both into a single failureReason: process-update-laboratory-run.lambda.ts When both exist (e.g. WORKFLOW_RUN_FAILED + task-level detail), ?? keeps only the machine code. Only that merged value is written to FailureReason: process-update-laboratory-run.lambda.ts ImpactFor ambiguous HealthOmics codes, the human-readable message (task name, log link) is lost before the classifier runs. Suggested fixAdd a FailureStatusMessage (or similar) field to LaboratoryRun. In getAWSHealthOmicsStatus, return failureReason and statusMessage separately. On FAILED transition, persist both. Keep FailureReason as the machine code only. Issue 3: Classifier passes FailureReason as statusMessageEven if issue 2 were fixed, the classifier currently duplicates the machine code into both LLM input fields. process-classify-laboratory-run-failure.lambda.ts ImpactThe LLM receives the machine code twice and no human-readable context — the main reason to use LLM for ambiguous HealthOmics failures. Suggested fixAfter issue 2, set statusMessage: run.FailureStatusMessage (or equivalent). Do not reuse FailureReason for both fields. Issue 4: Seqera errorReport is never sent to the classifierThe backend status-check path only captures errorMessage; errorReport is never fetched or passed through. Seqera snapshot only reads errorMessage: process-update-laboratory-run.lambda.ts The classifier only passes errorMessage (via FailureReason); errorReport is never set: process-classify-laboratory-run-failure.lambda.ts ClassificationInput supports errorReport, and the prompt uses it — but nothing in the pipeline populates it. The run detail page fetches errorReport client-side for display only ([labRunId].vue line 142); the async classifier never sees it. ImpactSeqera LLM classification runs on errorMessage alone and misses the richer stack-trace context the PR describes. Suggested fix:In getSeqeraCloudStatus, also return errorReport. Persist it on LaboratoryRun (new field, or pass through the SNS payload at classification time). In resolveClassification, set errorReport on ClassificationInput, truncated per MAX_FIELD_CHARS in classification-prompt.ts. |
feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration)
Type of Change*
Description
Previously a FAILED run showed only a generic status badge — users had to leave the app and inspect AWS Console or Seqera Cloud to find out what went wrong and who should fix it. This change captures the platform-provided failure reason at status-check time and classifies it by owner so the user knows whether it's a lab, a bioinformatician, or AWS problem.
Three layers, shipped together:
Surface raw failures.
process-update-laboratory-runnow extractsfailureReason/statusMessagefrom the HealthOmicsGetRunresponse andworkflow.errorMessagefrom Seqera and persists them asFailureReasonon theLaboratoryRunrow when a run transitions to FAILED. The run detail page renders a red banner with the raw text; Seqera also surfaceserrorReportin a collapsible.Classify by owner. New
FailureOwner('Lab' | 'Bioinformatician' | 'AWS' | 'Ambiguous'),FailureSummary,FailureAction,FailureClassifiedBy('lookup' | 'llm') fields onLaboratoryRun. A new async pipeline (SNS FIFO topic + SQS queue +process-classify-laboratory-run-failureLambda) fires on FAILED transitions:shared-lib/utils/failure-classifier.tscovers the 22 documented HealthOmics codes from the failure matrix.WORKFLOW_RUN_FAILED, genericRUN_TASK_FAILED) and all Seqera errors fall back to the LLM.BYOK LLM per integration. Each lab brings its own provider, model id, and (for OpenAI / Anthropic) API key — separately for HealthOmics and Seqera, so a lab can run, e.g., Bedrock for HealthOmics ambiguous codes + OpenAI for high-volume Seqera traffic. API keys live in SSM SecureString at
/easy-genomics/organization/{orgId}/laboratory/{labId}/llm-api-key-{healthomics|seqera}; the read-laboratory endpoint returns onlyHasHealthOmicsLlmApiKey/HasSeqeraLlmApiKeyboolean indicators. Three provider implementations:BedrockClassificationProvider(uses Lambda IAM, no key),OpenAIClassificationProvider,AnthropicClassificationProvider. All providers share the same prompt template + JSON response parser.UI surface:
Testing*
Unit tests — 64 new tests, all passing:
failure-classifier.test.ts(shared-lib)bedrock-classification-provider.test.tsopenai-classification-provider.test.tsresponse_format: json_object), success path, non-2xx fallback, network error fallback, malformed response fallbackanthropic-classification-provider.test.tsllm-classification-service.test.tsprocess-classify-laboratory-run-failure.lambda.test.tsllm-api-key-healthomicsvsllm-api-key-seqera), idempotency onFailureOwner-already-set, non-UPDATE op rejection, missing-provider skipprocess-update-laboratory-run.lambda.test.tsfailureReasoncapture from both platforms, persistence on FAILED transition, no-overwrite on subsequent checks, no-write on non-FAILED transitionsVisual previews — standalone HTML files (in
ai-failure-previews.zip, not committed) covering every owner × source combination on the run detail card and every provider × key-state combination on the lab settings form.Manual (post-deploy):
ECR_PERMISSION_ERROR). Verify within ~30s the run detail page shows the red banner + classification block withFailureClassifiedBy: lookup, owner "Bioinformatician".WORKFLOW_RUN_FAILED. With Bedrock configured on the lab, verify the classification block showsFailureClassifiedBy: llmwith the AI-assisted footnote.Impact
New AWS resources (CDK):
laboratory-run-failure-classification-topic(SNS FIFO)laboratory-run-failure-classification-queue(SQS FIFO)process-classify-laboratory-run-failureLambda (60s timeout, 512 MB, batchSize 1)New IAM for the classifier Lambda:
dynamodb:Queryonlaboratory-table,dynamodb:Query/UpdateItemonlaboratory-run-tablebedrock:InvokeModelscoped toarn:aws:bedrock:{region}::foundation-model/*(wildcard because the model is per-lab)ssm:GetParameteronllm-api-key-healthomicsandllm-api-key-seqerapathsNew SSM paths for lab CRUD Lambdas (create / update / read / delete):
/easy-genomics/organization/*/laboratory/*/llm-api-key-healthomics/easy-genomics/organization/*/laboratory/*/llm-api-key-seqeraNew runtime dependency:
@aws-sdk/client-bedrock-runtime@3.782.0(pinned for Smithy version compatibility with the rest of the SDK clients in the package; declared in.projenrc.ts).DynamoDB schema additions (all optional, no migration needed):
LaboratoryRun:FailureReason,FailureOwner,FailureSummary,FailureAction,FailureClassifiedByLaboratory:HealthOmicsLlmProvider,HealthOmicsLlmModelId,SeqeraLlmProvider,SeqeraLlmModelId(+HasHealthOmicsLlmApiKey/HasSeqeraLlmApiKeyon read responses only)Cost: incremental only when a lab opts in to LLM classification. Bedrock InvokeModel is metered per-token by AWS; OpenAI / Anthropic billed by the customer's own account. SSM SecureString cost is negligible (≤ $0.05 per 10k decrypts). Deterministic lookup path has zero LLM cost.
Performance: classification runs asynchronously after the FAILED transition; never blocks the status-check pipeline. The publish from
process-update-laboratory-runis best-effort (swallows errors so the status update completes even if the topic is misconfigured).Additional Information
back-end.failure-classificationblock ineasy-genomics.yaml; it has been removed in favor of fully per-lab BYOK. The only thing the operator controls is whether the Bedrock IAM is granted (already done by this PR's CDK changes).run.Platform.orgsStore.orgs[currentOrgId]?.NextFlowTowerApiBaseUrlnow uses optional chaining. Without it, the Settings page rendered blank wheneverorgsStore.orgshadn't loaded by the timeEGFormLabDetailsmounted.3.782.0(matched to the rest of the AWS SDK clients in the package) to avoid Smithy internal-type drift with the more recent default that pnpm would otherwise resolve.Checklist*
*-service.tspattern; Lambda isprocess-*per the routing convention.)withNormalizedLlmFields()rationale on the front-end, idempotency guard inprocessClassificationEvent.)