Skip to content

[WIP] feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration)#779

Open
diegomayorga-dept wants to merge 5 commits into
developmentfrom
feat/egv-180-show-omics-seqera-errors
Open

[WIP] feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration)#779
diegomayorga-dept wants to merge 5 commits into
developmentfrom
feat/egv-180-show-omics-seqera-errors

Conversation

@diegomayorga-dept
Copy link
Copy Markdown
Contributor

feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration)

Type of Change*

  • New feature
  • Bug fix
  • Documentation update
  • Refactoring
  • Hotfix
  • Security patch
  • UI/UX improvement

Description

Previously a FAILED run showed only a generic status badge — users had to leave the app and inspect AWS Console or Seqera Cloud to find out what went wrong and who should fix it. This change captures the platform-provided failure reason at status-check time and classifies it by owner so the user knows whether it's a lab, a bioinformatician, or AWS problem.

Three layers, shipped together:

  1. Surface raw failures. process-update-laboratory-run now extracts failureReason / statusMessage from the HealthOmics GetRun response and workflow.errorMessage from Seqera and persists them as FailureReason on the LaboratoryRun row when a run transitions to FAILED. The run detail page renders a red banner with the raw text; Seqera also surfaces errorReport in a collapsible.

  2. Classify by owner. New FailureOwner ('Lab' | 'Bioinformatician' | 'AWS' | 'Ambiguous'), FailureSummary, FailureAction, FailureClassifiedBy ('lookup' | 'llm') fields on LaboratoryRun. A new async pipeline (SNS FIFO topic + SQS queue + process-classify-laboratory-run-failure Lambda) fires on FAILED transitions:

    • Deterministic lookup in shared-lib/utils/failure-classifier.ts covers the 22 documented HealthOmics codes from the failure matrix.
    • Ambiguous HealthOmics codes (WORKFLOW_RUN_FAILED, generic RUN_TASK_FAILED) and all Seqera errors fall back to the LLM.
  3. BYOK LLM per integration. Each lab brings its own provider, model id, and (for OpenAI / Anthropic) API key — separately for HealthOmics and Seqera, so a lab can run, e.g., Bedrock for HealthOmics ambiguous codes + OpenAI for high-volume Seqera traffic. API keys live in SSM SecureString at /easy-genomics/organization/{orgId}/laboratory/{labId}/llm-api-key-{healthomics|seqera}; the read-laboratory endpoint returns only HasHealthOmicsLlmApiKey / HasSeqeraLlmApiKey boolean indicators. Three provider implementations: BedrockClassificationProvider (uses Lambda IAM, no key), OpenAIClassificationProvider, AnthropicClassificationProvider. All providers share the same prompt template + JSON response parser.

UI surface:

  • Lab settings → AI Failure Analysis card with HealthOmics + Seqera sub-sections, each with provider dropdown, model id, API key (with KEY SAVED badge when previously set), and a "None — disable AI analysis" reset option.
  • Run detail page → Failure analysis section above the existing failure banner, showing owner badge (color-coded: Lab amber / Bioinformatician red / AWS blue / Ambiguous gray), summary, "What to do next" action, and an "AI-assisted classification — verify before acting" footnote when the classification came from the LLM.

Testing*

Unit tests — 64 new tests, all passing:

Suite Coverage
failure-classifier.test.ts (shared-lib) Full table coverage of the 22 deterministic codes, null returns for the two ambiguous codes routed to LLM
bedrock-classification-provider.test.ts Response parser shape + truncation rules
openai-classification-provider.test.ts HTTP request shape (Bearer auth, response_format: json_object), success path, non-2xx fallback, network error fallback, malformed response fallback
anthropic-classification-provider.test.ts Same shape (x-api-key + anthropic-version headers, Messages API body)
llm-classification-service.test.ts Per-provider dispatch + null returns when modelId / apiKey missing
process-classify-laboratory-run-failure.lambda.test.ts Deterministic lookup, LLM fallback per platform, per-integration SSM key fetch (llm-api-key-healthomics vs llm-api-key-seqera), idempotency on FailureOwner-already-set, non-UPDATE op rejection, missing-provider skip
process-update-laboratory-run.lambda.test.ts 8 new tests added: failureReason capture from both platforms, persistence on FAILED transition, no-overwrite on subsequent checks, no-write on non-FAILED transitions

Visual previews — standalone HTML files (in ai-failure-previews.zip, not committed) covering every owner × source combination on the run detail card and every provider × key-state combination on the lab settings form.

Manual (post-deploy):

  • Trigger a HealthOmics failure with a deterministic code (e.g. submit a workflow with a private ECR image → ECR_PERMISSION_ERROR). Verify within ~30s the run detail page shows the red banner + classification block with FailureClassifiedBy: lookup, owner "Bioinformatician".
  • Trigger a HealthOmics failure with WORKFLOW_RUN_FAILED. With Bedrock configured on the lab, verify the classification block shows FailureClassifiedBy: llm with the AI-assisted footnote.
  • Trigger a Seqera failure (e.g. malformed sample sheet). With OpenAI key configured on the lab, verify the classification fires.
  • Set the provider dropdown to "None" on a lab — verify subsequent failures skip the LLM (raw banner only).
  • Confirm DynamoDB rows for newly-failed runs have the new fields populated; older FAILED rows render only the raw banner (no migration needed since all fields are optional).

Impact

New AWS resources (CDK):

  • laboratory-run-failure-classification-topic (SNS FIFO)
  • laboratory-run-failure-classification-queue (SQS FIFO)
  • process-classify-laboratory-run-failure Lambda (60s timeout, 512 MB, batchSize 1)

New IAM for the classifier Lambda:

  • dynamodb:Query on laboratory-table, dynamodb:Query/UpdateItem on laboratory-run-table
  • bedrock:InvokeModel scoped to arn:aws:bedrock:{region}::foundation-model/* (wildcard because the model is per-lab)
  • ssm:GetParameter on llm-api-key-healthomics and llm-api-key-seqera paths

New SSM paths for lab CRUD Lambdas (create / update / read / delete):

  • /easy-genomics/organization/*/laboratory/*/llm-api-key-healthomics
  • /easy-genomics/organization/*/laboratory/*/llm-api-key-seqera

New runtime dependency: @aws-sdk/client-bedrock-runtime@3.782.0 (pinned for Smithy version compatibility with the rest of the SDK clients in the package; declared in .projenrc.ts).

DynamoDB schema additions (all optional, no migration needed):

  • LaboratoryRun: FailureReason, FailureOwner, FailureSummary, FailureAction, FailureClassifiedBy
  • Laboratory: HealthOmicsLlmProvider, HealthOmicsLlmModelId, SeqeraLlmProvider, SeqeraLlmModelId (+ HasHealthOmicsLlmApiKey / HasSeqeraLlmApiKey on read responses only)

Cost: incremental only when a lab opts in to LLM classification. Bedrock InvokeModel is metered per-token by AWS; OpenAI / Anthropic billed by the customer's own account. SSM SecureString cost is negligible (≤ $0.05 per 10k decrypts). Deterministic lookup path has zero LLM cost.

Performance: classification runs asynchronously after the FAILED transition; never blocks the status-check pipeline. The publish from process-update-laboratory-run is best-effort (swallows errors so the status update completes even if the topic is misconfigured).

Additional Information

  • No operator-level YAML config required. Earlier iterations had a back-end.failure-classification block in easy-genomics.yaml; it has been removed in favor of fully per-lab BYOK. The only thing the operator controls is whether the Bedrock IAM is granted (already done by this PR's CDK changes).
  • Per-integration providers are independent. A lab can run Bedrock for HealthOmics and OpenAI for Seqera, or any other combination — the classifier Lambda picks the right config based on run.Platform.
  • Setting the provider IS the on/off. There is no separate enable toggle — selecting "None — disable AI analysis" in the dropdown disables LLM classification for that integration cleanly.
  • Lab settings form has a pre-existing crash bug that was fixed inline (line 68): orgsStore.orgs[currentOrgId]?.NextFlowTowerApiBaseUrl now uses optional chaining. Without it, the Settings page rendered blank whenever orgsStore.orgs hadn't loaded by the time EGFormLabDetails mounted.
  • Bedrock SDK was pinned to 3.782.0 (matched to the rest of the AWS SDK clients in the package) to avoid Smithy internal-type drift with the more recent default that pnpm would otherwise resolve.
  • Anyone with the LLM API key in their account is responsible for managing rotation — the classifier reads the current value on every classification, so a rotation just requires re-saving on the lab settings page.

Checklist*

  • No new errors or warnings have been introduced.
  • All tests pass successfully and new tests added as necessary.
  • Documentation has been updated accordingly. (Per-feature inline JSDoc on schemas, interfaces, and providers; YAML example block removed since it no longer applies.)
  • Code adheres to the coding and style guidelines of the project. (SSM-based per-lab secrets pattern mirrors the existing Seqera access token; service wrappers follow the existing *-service.ts pattern; Lambda is process-* per the routing convention.)
  • Code has been commented in particularly hard-to-understand areas. (Per-integration SSM path resolution in the classifier, withNormalizedLlmFields() rationale on the front-end, idempotency guard in processClassificationEvent.)

@diegomayorga-dept diegomayorga-dept changed the title Feat/egv 180 show omics seqera errors feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration) Jun 1, 2026
@kevin-loftus-dept
Copy link
Copy Markdown
Contributor

Here are 4 issues my Cursor PR review spotted, with specific lines of code referenced and possible solutions explained:

Issue 1: LLM provider/model settings are not persisted to DynamoDB

The API accepts LLM fields and the classifier reads them from the Laboratory row, but create/update never write them.

The update handler spreads ...existing and only maps a fixed set of fields — no LLM provider/model fields:

update-laboratory.lambda.ts
Lines 65-81

API keys go to SSM (lines 126–143), but provider/model do not.

Create has the same gap — laboratoryService.add() omits all four LLM fields:

create-laboratory.lambda.ts
Lines 67-81

Impact

Saving LLM settings in the UI stores API keys in SSM but leaves provider/model unset on the Laboratory row, so LLM classification is skipped (lines 116–118 return null). Selecting "None" also cannot clear an existing provider, because those fields are never written on update.

Suggested fix

In both create-laboratory.lambda.ts and update-laboratory.lambda.ts, map HealthOmicsLlmProvider, HealthOmicsLlmModelId, SeqeraLlmProvider, and SeqeraLlmModelId from the request into the add/update payload. For disable, explicitly clear them when the UI sends undefined (e.g. use null sentinels or DynamoDB REMOVE for cleared fields, since removeUndefinedValues: true would otherwise preserve stale values via ...existing).

Issue 2: HealthOmics statusMessage is discarded before classification

GetRun exposes both failureReason and statusMessage, but they are merged into one field and only that is persisted.

Snapshot collapses both into a single failureReason:

process-update-laboratory-run.lambda.ts
Lines 277-281

When both exist (e.g. WORKFLOW_RUN_FAILED + task-level detail), ?? keeps only the machine code.

Only that merged value is written to FailureReason:

process-update-laboratory-run.lambda.ts
Lines 221-223

Impact

For ambiguous HealthOmics codes, the human-readable message (task name, log link) is lost before the classifier runs.

Suggested fix

Add a FailureStatusMessage (or similar) field to LaboratoryRun. In getAWSHealthOmicsStatus, return failureReason and statusMessage separately. On FAILED transition, persist both. Keep FailureReason as the machine code only.

Issue 3: Classifier passes FailureReason as statusMessage

Even if issue 2 were fixed, the classifier currently duplicates the machine code into both LLM input fields.

process-classify-laboratory-run-failure.lambda.ts
Lines 124-130

Impact

The LLM receives the machine code twice and no human-readable context — the main reason to use LLM for ambiguous HealthOmics failures.

Suggested fix

After issue 2, set statusMessage: run.FailureStatusMessage (or equivalent). Do not reuse FailureReason for both fields.

Issue 4: Seqera errorReport is never sent to the classifier

The backend status-check path only captures errorMessage; errorReport is never fetched or passed through.

Seqera snapshot only reads errorMessage:

process-update-laboratory-run.lambda.ts
Lines 333-337

The classifier only passes errorMessage (via FailureReason); errorReport is never set:

process-classify-laboratory-run-failure.lambda.ts
Lines 124-130

ClassificationInput supports errorReport, and the prompt uses it — but nothing in the pipeline populates it. The run detail page fetches errorReport client-side for display only ([labRunId].vue line 142); the async classifier never sees it.

Impact

Seqera LLM classification runs on errorMessage alone and misses the richer stack-trace context the PR describes.

Suggested fix:

In getSeqeraCloudStatus, also return errorReport. Persist it on LaboratoryRun (new field, or pass through the SNS payload at classification time). In resolveClassification, set errorReport on ClassificationInput, truncated per MAX_FIELD_CHARS in classification-prompt.ts.

@diegomayorga-dept diegomayorga-dept changed the title feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration) [WIP] feat(runs): surface and classify HealthOmics + Seqera run failures (BYOK per integration) Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants