Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Adds experiment runner scripts and prompt templates to evaluate CoT/GIM configurations across offline (vLLM) and API (OpenRouter) model backends for KDD experiment runs.
Changes:
- Added bash scripts to run MedMCQA/QASC (and some GPQA) evaluations with/without GIM prompting and varying reasoning budgets.
- Added
auto_budget_prompt.txttemplates to drive automatic budget selection. - Duplicated “fix-error” result directories containing the same runner scripts and prompts.
Reviewed changes
Copilot reviewed 11 out of 273 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| results/260202-kdd-expt-2-full/*eval_gim.sh | Adds offline vLLM experiment runner for MedMCQA/QASC with GIM budgets. |
| results/260202-kdd-expt-2-full/*eval_api.sh | Adds OpenRouter/OpenAI-compatible API experiment runner with GIM budgets. |
| results/260202-kdd-expt-2-full-fix-error/auto_budget_prompt.txt | Adds auto-budget prompt template (example included). |
| results/260202-kdd-expt-2-full-fix-error/*eval_gim.sh | Same offline vLLM runner, placed in fix-error directory. |
| results/260202-kdd-expt-2-full-fix-error/*eval_api.sh | Same API runner, placed in fix-error directory. |
| results/260129-kdd-expt-2-full/eval_gim.sh | Adds parameterized offline vLLM runner across datasets/models. |
| results/260129-kdd-expt-2-full/eval_api.sh | Adds API runner (GPQA/QASC calls) with fixed budgets. |
| results/260129-kdd-expt-2-full/auto_budget_prompt.txt | Adds auto-budget prompt template (example included). |
| results/260129-kdd-expt-2-full-fix-error/eval_gim.sh | Same offline runner, placed in fix-error directory. |
| results/260129-kdd-expt-2-full-fix-error/eval_api.sh | Same API runner, placed in fix-error directory. |
| results/260129-kdd-expt-2-full-fix-error/auto_budget_prompt.txt | Same auto-budget prompt template, placed in fix-error directory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2 | ||
| export AUTO_BUDGET_PROMPT="" |
There was a problem hiding this comment.
The script proceeds with --auto_budget_prompt "$AUTO_BUDGET_PROMPT" even when auto_budget_prompt.txt is missing, which will pass an empty prompt and likely change/degenerate auto-budget behavior. Instead of defaulting to an empty string, fail fast (exit non-zero) when --auto_budget is used but the prompt file is absent, or vendor the prompt file into this directory like the other result folders do.
| echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2 | |
| export AUTO_BUDGET_PROMPT="" | |
| echo "Error: auto_budget_prompt.txt not found in $script_dir" >&2 | |
| exit 1 |
| python -m "gimbench.mcqa.medmcqa" --use_gim_prompt --output_type cfg --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \ | ||
| --auto_budget --auto_budget_prompt "$AUTO_BUDGET_PROMPT" \ |
There was a problem hiding this comment.
The script proceeds with --auto_budget_prompt "$AUTO_BUDGET_PROMPT" even when auto_budget_prompt.txt is missing, which will pass an empty prompt and likely change/degenerate auto-budget behavior. Instead of defaulting to an empty string, fail fast (exit non-zero) when --auto_budget is used but the prompt file is absent, or vendor the prompt file into this directory like the other result folders do.
| set -x | ||
|
|
||
| export API_KEY=your_api_key_here |
There was a problem hiding this comment.
set -x will echo expanded command lines, which includes --api_key "$API_KEY" and can leak keys into logs/CI output/shell history. Also, hardcoding export API_KEY=your_api_key_here encourages committing secrets-by-edit. Prefer requiring API_KEY to be provided from the environment (and exit with a clear message if missing), and disable tracing around the API calls (or avoid --api_key on the CLI and use a safer config/env mechanism if supported by gimbench).
| set -x | |
| export API_KEY=your_api_key_here | |
| # Require API_KEY to be provided via the environment; do not hardcode secrets here. | |
| if [ -z "${API_KEY:-}" ]; then | |
| echo "Error: API_KEY environment variable is not set. Please export API_KEY before running this script." >&2 | |
| exit 1 | |
| fi |
| python -m "gimbench.mcqa.medmcqa" --model_type openai --model_name "$model" \ | ||
| --api_key "$API_KEY" --base_url "$API_BASE" --no_gimkit --num_proc 40 --first_n 500 |
There was a problem hiding this comment.
set -x will echo expanded command lines, which includes --api_key "$API_KEY" and can leak keys into logs/CI output/shell history. Also, hardcoding export API_KEY=your_api_key_here encourages committing secrets-by-edit. Prefer requiring API_KEY to be provided from the environment (and exit with a clear message if missing), and disable tracing around the API calls (or avoid --api_key on the CLI and use a safer config/env mechanism if supported by gimbench).
|
|
||
| shutdown -h +3 |
There was a problem hiding this comment.
Unconditionally scheduling a host shutdown from a repo script is risky (surprising side effects for local runs, shared machines, CI, or remote sessions). Make shutdown opt-in (e.g., behind an environment flag), or remove it and document the intended operational workflow separately.
| shutdown -h +3 | |
| # Optional host shutdown: enable by setting ALLOW_HOST_SHUTDOWN=1 in the environment. | |
| if [ "${ALLOW_HOST_SHUTDOWN:-0}" = "1" ]; then | |
| echo "ALLOW_HOST_SHUTDOWN=1 set; scheduling host shutdown in 3 minutes..." | |
| shutdown -h +3 | |
| else | |
| echo "Skipping host shutdown. To enable, set ALLOW_HOST_SHUTDOWN=1 in the environment before running this script." >&2 | |
| fi |
| setup_prompt | ||
| run_api_experiments | ||
|
|
||
| shutdown -h +3 |
There was a problem hiding this comment.
Same concern as the offline runner: an unconditional shutdown is a hazardous default in a checked-in script. Gate it behind an explicit flag or drop it entirely to prevent accidental machine shutdowns.
| shutdown -h +3 | |
| if [ "${ENABLE_EXPERIMENT_SHUTDOWN:-0}" = "1" ]; then | |
| shutdown -h +3 | |
| fi |
| python -m "gimbench.mcqa.medmcqa" --model_type openai --model_name "$model" \ | ||
| --api_key "$API_KEY" --base_url "$API_BASE" --no_gimkit --num_proc 40 --first_n 500 | ||
| python -m "gimbench.mcqa.medmcqa" --use_gim_prompt --output_type json --model_type openai \ | ||
| --model_name "$model" --api_key "$API_KEY" --base_url "$API_BASE" \ |
There was a problem hiding this comment.
The baseline run omits --output_type while the GIM run forces --output_type json. If downstream analysis expects a consistent artifact format, this will produce incompatible outputs between baseline vs GIM runs. Consider explicitly setting the same --output_type for both baseline and prompted runs (or document why the formats intentionally differ).
|
|
||
|
|
||
| run_gim_experiments() { | ||
| python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \ |
There was a problem hiding this comment.
Similar to the API script, the baseline run doesn’t specify --output_type while the GIM runs set --output_type cfg. This can make result aggregation brittle if tools assume a single format. Recommend setting --output_type cfg on the baseline as well (or keeping both in the same format used by your evaluation pipeline).
| python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \ | |
| python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" --output_type cfg \ |
| @@ -0,0 +1,54 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
The filename shown in the PR uses a literal * (*eval_gim.sh). Asterisk characters in filenames are error-prone because they interact badly with shell globbing and are not portable across all environments/tools. Consider renaming these files to a concrete name (e.g., eval_gim.sh) and, if you need a glob, keep it in documentation or a wrapper script rather than in the tracked filename.
No description provided.