-
Notifications
You must be signed in to change notification settings - Fork 121
Jobs to run of Apps Codegen bulk gen + Evals #4103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Change requires-python from >=3.11 to >=3.10 - Replace str | None union syntax with Optional[str] for 3.10 compat - Remove unused databricks-sdk and tqdm dependencies Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Remove bundle run dependency (databricks CLI not available in serverless) - Clone appdotbuild-agent repo and install klaudbiusz deps - Handle case of no apps gracefully - log sample metrics to MLflow - Job successfully validates infrastructure and logs to MLflow Note: Full eval requires Python 3.12+ or pre-populated apps Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add apps_generation_job.job.yml with single-node Docker cluster
- Add generate_apps.py orchestrator using klaudbiusz framework
- Add init/setup_generation.sh to install Dagger and Python deps
- Update run_evals.py to read apps from UC Volume
- Add variables for CLI binary and generated apps volumes
Generation uses databricks experimental apps-mcp as the MCP server,
built from this repo for Linux x86_64.
Prerequisites:
- Create secret: databricks secrets put-secret apps-mcp-evals anthropic-api-key
- Upload CLI: GOOS=linux GOARCH=amd64 go build -o databricks-linux .
databricks fs cp databricks-linux /Volumes/main/evals/artifacts/
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Use main.default.apps_mcp_artifacts and main.default.apps_mcp_generated volumes which were created successfully. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Use LiteLLM backend (anthropic/claude-sonnet-4-20250514) to bypass Claude Agent SDK root user restriction on Databricks clusters - Replace symlinks with latest.txt file (symlinks not supported on UC Volumes) - Revert docker_image and data_security_mode changes (not needed with LiteLLM) - Successfully tested: generated hello-world app at $2.33 cost Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Change entry point from main to cli wrapper that uses fire.Fire() - This enables proper CLI argument parsing for wheel package - Now correctly receives apps_volume parameter from job config Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…eval, simplifying the code and enhancing maintainability.
… job structure, prerequisites, and configuration details. Introduce Generation and Evaluation jobs, update quick start commands, and add prompt sets and known limitations sections.
| echo "Node version: $(node --version)" | ||
| echo "npm version: $(npm --version)" | ||
|
|
||
| # Install Docker (required for --no-dagger mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
| echo "=== Setting up generation environment ===" | ||
|
|
||
| # Install Dagger (required for klaudbiusz container orchestration) | ||
| echo "Installing Dagger..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
…ment from setup_eval.sh and run_evals.py, simplifying the evaluation process. Update Node.js installation comment for clarity and adjust evaluation runner to use local execution mode.
…nction to execute app evaluations without Docker, enhancing the evaluation process. Update main function to utilize local mode and improve output messages for clarity.
| ResourceClass: SingleNode | ||
| spark_env_vars: | ||
| DATABRICKS_HOST: ${workspace.host} | ||
| DATABRICKS_TOKEN: "{{secrets/apps-mcp-evals/databricks-token}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think PAT is necessary here
Changes
This PR creates Lafeflow Job for long-running Apps Codegen Evals by cloning from this repo and running sample generation of Databricks apps using current cli mcp and running evals on them and publishing the result to mlflow.
Testing
evgenii.kniazev@FP424MF2FY cli % cd experimental/apps-mcp/evals
evgenii.kniazev@FP424MF2FY evals % databricks auth login
✔ Databricks profile name [DEFAULT]: █
Profile DEFAULT was successfully saved
evgenii.kniazev@FP424MF2FY evals % databricks bundle validate -t dev
Name: apps-mcp-evals
Target: dev
Workspace:..
Validation OK!
evgenii.kniazev@FP424MF2FY evals % databricks bundle deploy -t dev
Building apps_mcp_evals...
Uploading dist/apps_mcp_evals-0.1.0-py3-none-any.whl...
Uploading bundle files to /..
Deploying resources...
Updating deployment state...
Deployment complete!
evgenii.kniazev@FP424MF2FY evals % databricks bundle run -t dev apps_eval_job
Run URL: ...
2025-12-05 17:00:16 "[dev evgenii_kniazev] [dev] Apps-MCP Continuous Evals" RUNNING