Skip to content

feat: Evaluation Client — Lifecycle, Orchestration & Online Pipeline #393

@Hweinstock

Description

@Hweinstock

Problem

The SDK's EvaluationClient only exposes run(). On the control plane side, customers cannot programmatically create custom evaluators (LLM-as-a-judge configs), list available evaluators, update or delete evaluators, or manage online evaluation configs for continuous evaluation on live traffic — evaluator provisioning requires the console. On the data plane side, the starter toolkit's EvaluationProcessor provides significantly richer orchestration than run(): it fetches session data from CloudWatch independently, groups evaluators by level (SESSION vs TRACE), determines which spans to send based on evaluator level, and runs multiple evaluators with per-evaluator error handling. The toolkit also provides input validation, IAM role cleanup on delete, and typed config/result models.

Acceptance Criteria

  • Customers can create, get, list, update, and delete custom evaluators
  • Customers can create, get, list, update, and delete online evaluation configs
  • Online evaluation config supports enable/disable toggling and sampling rate adjustment
  • Typed result models with error introspection (has_error(), get_successful_results())
  • Customers can fetch session trace data (spans + runtime logs) from CloudWatch for a given session and agent
  • Customers can find the most recent session for an agent
  • Multi-evaluator orchestration groups evaluators by level and selects appropriate spans per level
  • Per-evaluator error handling — failures on one evaluator don't block others
  • Online evaluation config deletion supports optional IAM execution role cleanup
  • All functionality is verified via integration tests running in CI

Relevant Links

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions