Skip to content

Conversation

@keugenek
Copy link
Contributor

@keugenek keugenek commented Dec 5, 2025

Changes

This PR creates Lafeflow Job for long-running Apps Codegen Evals by cloning from this repo and running sample generation of Databricks apps using current cli mcp and running evals on them and publishing the result to mlflow.

Testing

evgenii.kniazev@FP424MF2FY cli % cd experimental/apps-mcp/evals
evgenii.kniazev@FP424MF2FY evals % databricks auth login
✔ Databricks profile name [DEFAULT]: █
Profile DEFAULT was successfully saved
evgenii.kniazev@FP424MF2FY evals % databricks bundle validate -t dev
Name: apps-mcp-evals
Target: dev
Workspace:..

Validation OK!
evgenii.kniazev@FP424MF2FY evals % databricks bundle deploy -t dev
Building apps_mcp_evals...
Uploading dist/apps_mcp_evals-0.1.0-py3-none-any.whl...
Uploading bundle files to /..
Deploying resources...
Updating deployment state...
Deployment complete!
evgenii.kniazev@FP424MF2FY evals % databricks bundle run -t dev apps_eval_job
Run URL: ...

2025-12-05 17:00:16 "[dev evgenii_kniazev] [dev] Apps-MCP Continuous Evals" RUNNING

@keugenek keugenek requested review from a team and lennartkats-db as code owners December 5, 2025 17:03
@keugenek keugenek marked this pull request as draft December 5, 2025 17:04
- Change requires-python from >=3.11 to >=3.10
- Replace str | None union syntax with Optional[str] for 3.10 compat
- Remove unused databricks-sdk and tqdm dependencies

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@keugenek keugenek requested review from fjakobs and igrekun December 5, 2025 17:23
- Remove bundle run dependency (databricks CLI not available in serverless)
- Clone appdotbuild-agent repo and install klaudbiusz deps
- Handle case of no apps gracefully - log sample metrics to MLflow
- Job successfully validates infrastructure and logs to MLflow

Note: Full eval requires Python 3.12+ or pre-populated apps

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add apps_generation_job.job.yml with single-node Docker cluster
- Add generate_apps.py orchestrator using klaudbiusz framework
- Add init/setup_generation.sh to install Dagger and Python deps
- Update run_evals.py to read apps from UC Volume
- Add variables for CLI binary and generated apps volumes

Generation uses databricks experimental apps-mcp as the MCP server,
built from this repo for Linux x86_64.

Prerequisites:
- Create secret: databricks secrets put-secret apps-mcp-evals anthropic-api-key
- Upload CLI: GOOS=linux GOARCH=amd64 go build -o databricks-linux .
             databricks fs cp databricks-linux /Volumes/main/evals/artifacts/

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Use main.default.apps_mcp_artifacts and main.default.apps_mcp_generated
volumes which were created successfully.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
keugenek and others added 3 commits December 9, 2025 14:14
- Use LiteLLM backend (anthropic/claude-sonnet-4-20250514) to bypass
  Claude Agent SDK root user restriction on Databricks clusters
- Replace symlinks with latest.txt file (symlinks not supported on UC Volumes)
- Revert docker_image and data_security_mode changes (not needed with LiteLLM)
- Successfully tested: generated hello-world app at $2.33 cost

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Change entry point from main to cli wrapper that uses fire.Fire()
- This enables proper CLI argument parsing for wheel package
- Now correctly receives apps_volume parameter from job config

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…eval, simplifying the code and enhancing maintainability.
… job structure, prerequisites, and configuration details. Introduce Generation and Evaluation jobs, update quick start commands, and add prompt sets and known limitations sections.
@keugenek keugenek marked this pull request as ready for review December 11, 2025 16:29
@keugenek keugenek changed the title Skeleton for nightly run of Apps Codegen Evals Jobs to run of Apps Codegen bulk gen + Evals Dec 11, 2025
echo "Node version: $(node --version)"
echo "npm version: $(npm --version)"

# Install Docker (required for --no-dagger mode)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

echo "=== Setting up generation environment ==="

# Install Dagger (required for klaudbiusz container orchestration)
echo "Installing Dagger..."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

…ment from setup_eval.sh and run_evals.py, simplifying the evaluation process. Update Node.js installation comment for clarity and adjust evaluation runner to use local execution mode.
…nction to execute app evaluations without Docker, enhancing the evaluation process. Update main function to utilize local mode and improve output messages for clarity.
ResourceClass: SingleNode
spark_env_vars:
DATABRICKS_HOST: ${workspace.host}
DATABRICKS_TOKEN: "{{secrets/apps-mcp-evals/databricks-token}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think PAT is necessary here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants