Jobs to run of Apps Codegen bulk gen + Evals #4103

keugenek · 2025-12-05T17:03:46Z

Changes

This PR creates Lafeflow Job for long-running Apps Codegen Evals by cloning from this repo and running sample generation of Databricks apps using current cli mcp and running evals on them and publishing the result to mlflow.

Testing

evgenii.kniazev@FP424MF2FY cli % cd experimental/apps-mcp/evals
evgenii.kniazev@FP424MF2FY evals % databricks auth login
✔ Databricks profile name [DEFAULT]: █
Profile DEFAULT was successfully saved
evgenii.kniazev@FP424MF2FY evals % databricks bundle validate -t dev
Name: apps-mcp-evals
Target: dev
Workspace:..

Validation OK!
evgenii.kniazev@FP424MF2FY evals % databricks bundle deploy -t dev
Building apps_mcp_evals...
Uploading dist/apps_mcp_evals-0.1.0-py3-none-any.whl...
Uploading bundle files to /..
Deploying resources...
Updating deployment state...
Deployment complete!
evgenii.kniazev@FP424MF2FY evals % databricks bundle run -t dev apps_eval_job
Run URL: ...

2025-12-05 17:00:16 "[dev evgenii_kniazev] [dev] Apps-MCP Continuous Evals" RUNNING

- Change requires-python from >=3.11 to >=3.10 - Replace str | None union syntax with Optional[str] for 3.10 compat - Remove unused databricks-sdk and tqdm dependencies Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

- Remove bundle run dependency (databricks CLI not available in serverless) - Clone appdotbuild-agent repo and install klaudbiusz deps - Handle case of no apps gracefully - log sample metrics to MLflow - Job successfully validates infrastructure and logs to MLflow Note: Full eval requires Python 3.12+ or pre-populated apps Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

- Add apps_generation_job.job.yml with single-node Docker cluster - Add generate_apps.py orchestrator using klaudbiusz framework - Add init/setup_generation.sh to install Dagger and Python deps - Update run_evals.py to read apps from UC Volume - Add variables for CLI binary and generated apps volumes Generation uses databricks experimental apps-mcp as the MCP server, built from this repo for Linux x86_64. Prerequisites: - Create secret: databricks secrets put-secret apps-mcp-evals anthropic-api-key - Upload CLI: GOOS=linux GOARCH=amd64 go build -o databricks-linux . databricks fs cp databricks-linux /Volumes/main/evals/artifacts/ Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Use main.default.apps_mcp_artifacts and main.default.apps_mcp_generated volumes which were created successfully. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

- Use LiteLLM backend (anthropic/claude-sonnet-4-20250514) to bypass Claude Agent SDK root user restriction on Databricks clusters - Replace symlinks with latest.txt file (symlinks not supported on UC Volumes) - Revert docker_image and data_security_mode changes (not needed with LiteLLM) - Successfully tested: generated hello-world app at $2.33 cost Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

- Change entry point from main to cli wrapper that uses fire.Fire() - This enables proper CLI argument parsing for wheel package - Now correctly receives apps_volume parameter from job config Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

…-nightly

…eval, simplifying the code and enhancing maintainability.

… job structure, prerequisites, and configuration details. Introduce Generation and Evaluation jobs, update quick start commands, and add prompt sets and known limitations sections.

keugenek · 2025-12-11T17:07:56Z

experimental/apps-mcp/evals/init/setup_eval.sh

+echo "Node version: $(node --version)"
+echo "npm version: $(npm --version)"
+
+# Install Docker (required for --no-dagger mode)


keugenek · 2025-12-11T17:08:06Z

experimental/apps-mcp/evals/init/setup_generation.sh

+echo "=== Setting up generation environment ==="
+
+# Install Dagger (required for klaudbiusz container orchestration)
+echo "Installing Dagger..."


…ment from setup_eval.sh and run_evals.py, simplifying the evaluation process. Update Node.js installation comment for clarity and adjust evaluation runner to use local execution mode.

…nction to execute app evaluations without Docker, enhancing the evaluation process. Update main function to utilize local mode and improve output messages for clarity.

arsenyinfo · 2025-12-17T10:41:08Z

experimental/apps-mcp/evals/resources/apps_eval_job.job.yml

+              ResourceClass: SingleNode
+            spark_env_vars:
+              DATABRICKS_HOST: ${workspace.host}
+              DATABRICKS_TOKEN: "{{secrets/apps-mcp-evals/databricks-token}}"


I don't think PAT is necessary here

Skeleton for nightly run of Apps Codegen Evals

990c46e

keugenek requested review from a team and lennartkats-db as code owners December 5, 2025 17:03

keugenek marked this pull request as draft December 5, 2025 17:04

keugenek temporarily deployed to test-trigger-is December 5, 2025 17:04 — with GitHub Actions Inactive

keugenek temporarily deployed to test-trigger-is December 5, 2025 17:06 — with GitHub Actions Inactive

Proper url for evals repo

8c1e665

keugenek requested review from fjakobs and igrekun December 5, 2025 17:23

keugenek temporarily deployed to test-trigger-is December 5, 2025 17:24 — with GitHub Actions Inactive

keugenek temporarily deployed to test-trigger-is December 8, 2025 10:39 — with GitHub Actions Inactive

Merge branch 'main' into exprimental/mcp-evals-nightly

4d58e99

keugenek temporarily deployed to test-trigger-is December 8, 2025 14:42 — with GitHub Actions Inactive

keugenek temporarily deployed to test-trigger-is December 8, 2025 14:53 — with GitHub Actions Inactive

Fix UC Volume paths for CLI binary and generated apps

ea20bd0

Use main.default.apps_mcp_artifacts and main.default.apps_mcp_generated volumes which were created successfully. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

keugenek temporarily deployed to test-trigger-is December 8, 2025 14:58 — with GitHub Actions Inactive

keugenek and others added 3 commits December 9, 2025 14:14

Merge remote-tracking branch 'origin/main' into exprimental/mcp-evals…

d4cfd37

…-nightly

keugenek temporarily deployed to test-trigger-is December 10, 2025 11:39 — with GitHub Actions Inactive

Merge branch 'main' into exprimental/mcp-evals-nightly

52536b5

keugenek temporarily deployed to test-trigger-is December 10, 2025 14:24 — with GitHub Actions Inactive

Required to bypass proc mount restrictions and AppArmor.

cf78b42

keugenek temporarily deployed to test-trigger-is December 11, 2025 16:18 — with GitHub Actions Inactive

keugenek added 2 commits December 11, 2025 16:23

Refactor get_prompts function to use external import for prompt retri…

3d5b131

…eval, simplifying the code and enhancing maintainability.

Update README.md for Apps-MCP Evals: Enhance documentation to clarify…

b08f318

… job structure, prerequisites, and configuration details. Introduce Generation and Evaluation jobs, update quick start commands, and add prompt sets and known limitations sections.

keugenek temporarily deployed to test-trigger-is December 11, 2025 16:27 — with GitHub Actions Inactive

keugenek marked this pull request as ready for review December 11, 2025 16:29

Merge branch 'main' into exprimental/mcp-evals-nightly

7c04633

keugenek changed the title ~~Skeleton for nightly run of Apps Codegen Evals~~ Jobs to run of Apps Codegen bulk gen + Evals Dec 11, 2025

keugenek temporarily deployed to test-trigger-is December 11, 2025 17:01 — with GitHub Actions Inactive

keugenek commented Dec 11, 2025

View reviewed changes

Refactor eval setup and runner: Remove Docker installation and manage…

b2cf28c

…ment from setup_eval.sh and run_evals.py, simplifying the evaluation process. Update Node.js installation comment for clarity and adjust evaluation runner to use local execution mode.

keugenek temporarily deployed to test-trigger-is December 11, 2025 17:34 — with GitHub Actions Inactive

Add local evaluation functionality: Implement run_local_evaluation fu…

6df3873

…nction to execute app evaluations without Docker, enhancing the evaluation process. Update main function to utilize local mode and improve output messages for clarity.

keugenek temporarily deployed to test-trigger-is December 12, 2025 10:27 — with GitHub Actions Inactive

arsenyinfo reviewed Dec 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jobs to run of Apps Codegen bulk gen + Evals #4103

Jobs to run of Apps Codegen bulk gen + Evals #4103

keugenek commented Dec 5, 2025 •

edited

Loading

Uh oh!

keugenek Dec 11, 2025

Uh oh!

keugenek Dec 11, 2025

Uh oh!

arsenyinfo Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jobs to run of Apps Codegen bulk gen + Evals #4103

Are you sure you want to change the base?

Jobs to run of Apps Codegen bulk gen + Evals #4103

Conversation

keugenek commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

keugenek Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

keugenek Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

arsenyinfo Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

keugenek commented Dec 5, 2025 •

edited

Loading