Skip to content

Comments

feat: eval cot results#75

Merged
Ki-Seki merged 13 commits intofeat/eval-resultsfrom
feat/eval-cot
Feb 10, 2026
Merged

feat: eval cot results#75
Ki-Seki merged 13 commits intofeat/eval-resultsfrom
feat/eval-cot

Conversation

@Duguce
Copy link
Contributor

@Duguce Duguce commented Feb 2, 2026

No description provided.

@Ki-Seki Ki-Seki changed the base branch from main to feat/eval-results February 10, 2026 10:53
@Ki-Seki Ki-Seki marked this pull request as ready for review February 10, 2026 10:56
Copilot AI review requested due to automatic review settings February 10, 2026 10:56
@Ki-Seki Ki-Seki merged commit 03999d3 into feat/eval-results Feb 10, 2026
1 of 2 checks passed
@Ki-Seki Ki-Seki deleted the feat/eval-cot branch February 10, 2026 10:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds experiment runner scripts and prompt templates to evaluate CoT/GIM configurations across offline (vLLM) and API (OpenRouter) model backends for KDD experiment runs.

Changes:

  • Added bash scripts to run MedMCQA/QASC (and some GPQA) evaluations with/without GIM prompting and varying reasoning budgets.
  • Added auto_budget_prompt.txt templates to drive automatic budget selection.
  • Duplicated “fix-error” result directories containing the same runner scripts and prompts.

Reviewed changes

Copilot reviewed 11 out of 273 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
results/260202-kdd-expt-2-full/*eval_gim.sh Adds offline vLLM experiment runner for MedMCQA/QASC with GIM budgets.
results/260202-kdd-expt-2-full/*eval_api.sh Adds OpenRouter/OpenAI-compatible API experiment runner with GIM budgets.
results/260202-kdd-expt-2-full-fix-error/auto_budget_prompt.txt Adds auto-budget prompt template (example included).
results/260202-kdd-expt-2-full-fix-error/*eval_gim.sh Same offline vLLM runner, placed in fix-error directory.
results/260202-kdd-expt-2-full-fix-error/*eval_api.sh Same API runner, placed in fix-error directory.
results/260129-kdd-expt-2-full/eval_gim.sh Adds parameterized offline vLLM runner across datasets/models.
results/260129-kdd-expt-2-full/eval_api.sh Adds API runner (GPQA/QASC calls) with fixed budgets.
results/260129-kdd-expt-2-full/auto_budget_prompt.txt Adds auto-budget prompt template (example included).
results/260129-kdd-expt-2-full-fix-error/eval_gim.sh Same offline runner, placed in fix-error directory.
results/260129-kdd-expt-2-full-fix-error/eval_api.sh Same API runner, placed in fix-error directory.
results/260129-kdd-expt-2-full-fix-error/auto_budget_prompt.txt Same auto-budget prompt template, placed in fix-error directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +11
echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2
export AUTO_BUDGET_PROMPT=""
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script proceeds with --auto_budget_prompt "$AUTO_BUDGET_PROMPT" even when auto_budget_prompt.txt is missing, which will pass an empty prompt and likely change/degenerate auto-budget behavior. Instead of defaulting to an empty string, fail fast (exit non-zero) when --auto_budget is used but the prompt file is absent, or vendor the prompt file into this directory like the other result folders do.

Suggested change
echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2
export AUTO_BUDGET_PROMPT=""
echo "Error: auto_budget_prompt.txt not found in $script_dir" >&2
exit 1

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +22
python -m "gimbench.mcqa.medmcqa" --use_gim_prompt --output_type cfg --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \
--auto_budget --auto_budget_prompt "$AUTO_BUDGET_PROMPT" \
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script proceeds with --auto_budget_prompt "$AUTO_BUDGET_PROMPT" even when auto_budget_prompt.txt is missing, which will pass an empty prompt and likely change/degenerate auto-budget behavior. Instead of defaulting to an empty string, fail fast (exit non-zero) when --auto_budget is used but the prompt file is absent, or vendor the prompt file into this directory like the other result folders do.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +4
set -x

export API_KEY=your_api_key_here
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -x will echo expanded command lines, which includes --api_key "$API_KEY" and can leak keys into logs/CI output/shell history. Also, hardcoding export API_KEY=your_api_key_here encourages committing secrets-by-edit. Prefer requiring API_KEY to be provided from the environment (and exit with a clear message if missing), and disable tracing around the API calls (or avoid --api_key on the CLI and use a safer config/env mechanism if supported by gimbench).

Suggested change
set -x
export API_KEY=your_api_key_here
# Require API_KEY to be provided via the environment; do not hardcode secrets here.
if [ -z "${API_KEY:-}" ]; then
echo "Error: API_KEY environment variable is not set. Please export API_KEY before running this script." >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +34
python -m "gimbench.mcqa.medmcqa" --model_type openai --model_name "$model" \
--api_key "$API_KEY" --base_url "$API_BASE" --no_gimkit --num_proc 40 --first_n 500
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -x will echo expanded command lines, which includes --api_key "$API_KEY" and can leak keys into logs/CI output/shell history. Also, hardcoding export API_KEY=your_api_key_here encourages committing secrets-by-edit. Prefer requiring API_KEY to be provided from the environment (and exit with a clear message if missing), and disable tracing around the API calls (or avoid --api_key on the CLI and use a safer config/env mechanism if supported by gimbench).

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +54

shutdown -h +3
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditionally scheduling a host shutdown from a repo script is risky (surprising side effects for local runs, shared machines, CI, or remote sessions). Make shutdown opt-in (e.g., behind an environment flag), or remove it and document the intended operational workflow separately.

Suggested change
shutdown -h +3
# Optional host shutdown: enable by setting ALLOW_HOST_SHUTDOWN=1 in the environment.
if [ "${ALLOW_HOST_SHUTDOWN:-0}" = "1" ]; then
echo "ALLOW_HOST_SHUTDOWN=1 set; scheduling host shutdown in 3 minutes..."
shutdown -h +3
else
echo "Skipping host shutdown. To enable, set ALLOW_HOST_SHUTDOWN=1 in the environment before running this script." >&2
fi

Copilot uses AI. Check for mistakes.
setup_prompt
run_api_experiments

shutdown -h +3
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as the offline runner: an unconditional shutdown is a hazardous default in a checked-in script. Gate it behind an explicit flag or drop it entirely to prevent accidental machine shutdowns.

Suggested change
shutdown -h +3
if [ "${ENABLE_EXPERIMENT_SHUTDOWN:-0}" = "1" ]; then
shutdown -h +3
fi

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +36
python -m "gimbench.mcqa.medmcqa" --model_type openai --model_name "$model" \
--api_key "$API_KEY" --base_url "$API_BASE" --no_gimkit --num_proc 40 --first_n 500
python -m "gimbench.mcqa.medmcqa" --use_gim_prompt --output_type json --model_type openai \
--model_name "$model" --api_key "$API_KEY" --base_url "$API_BASE" \
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baseline run omits --output_type while the GIM run forces --output_type json. If downstream analysis expects a consistent artifact format, this will produce incompatible outputs between baseline vs GIM runs. Consider explicitly setting the same --output_type for both baseline and prompted runs (or document why the formats intentionally differ).

Copilot uses AI. Check for mistakes.


run_gim_experiments() {
python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the API script, the baseline run doesn’t specify --output_type while the GIM runs set --output_type cfg. This can make result aggregation brittle if tools assume a single format. Recommend setting --output_type cfg on the baseline as well (or keeping both in the same format used by your evaluation pipeline).

Suggested change
python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \
python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" --output_type cfg \

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,54 @@
#!/bin/bash
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename shown in the PR uses a literal * (*eval_gim.sh). Asterisk characters in filenames are error-prone because they interact badly with shell globbing and are not portable across all environments/tools. Consider renaming these files to a concrete name (e.g., eval_gim.sh) and, if you need a glob, keep it in documentation or a wrapper script rather than in the tracked filename.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants