feat: eval cot results by Duguce · Pull Request #75 · SculptAI/GIMBench

Duguce · 2026-02-02T08:10:29Z

No description provided.

for more information, see https://pre-commit.ci

results/260202-kdd-expt-2-full/*eval_gim.sh

for more information, see https://pre-commit.ci

Copilot

Pull request overview

Adds experiment runner scripts and prompt templates to evaluate CoT/GIM configurations across offline (vLLM) and API (OpenRouter) model backends for KDD experiment runs.

Changes:

Added bash scripts to run MedMCQA/QASC (and some GPQA) evaluations with/without GIM prompting and varying reasoning budgets.
Added auto_budget_prompt.txt templates to drive automatic budget selection.
Duplicated “fix-error” result directories containing the same runner scripts and prompts.

Reviewed changes

Copilot reviewed 11 out of 273 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
results/260202-kdd-expt-2-full/*eval_gim.sh	Adds offline vLLM experiment runner for MedMCQA/QASC with GIM budgets.
results/260202-kdd-expt-2-full/*eval_api.sh	Adds OpenRouter/OpenAI-compatible API experiment runner with GIM budgets.
results/260202-kdd-expt-2-full-fix-error/auto_budget_prompt.txt	Adds auto-budget prompt template (example included).
results/260202-kdd-expt-2-full-fix-error/*eval_gim.sh	Same offline vLLM runner, placed in fix-error directory.
results/260202-kdd-expt-2-full-fix-error/*eval_api.sh	Same API runner, placed in fix-error directory.
results/260129-kdd-expt-2-full/eval_gim.sh	Adds parameterized offline vLLM runner across datasets/models.
results/260129-kdd-expt-2-full/eval_api.sh	Adds API runner (GPQA/QASC calls) with fixed budgets.
results/260129-kdd-expt-2-full/auto_budget_prompt.txt	Adds auto-budget prompt template (example included).
results/260129-kdd-expt-2-full-fix-error/eval_gim.sh	Same offline runner, placed in fix-error directory.
results/260129-kdd-expt-2-full-fix-error/eval_api.sh	Same API runner, placed in fix-error directory.
results/260129-kdd-expt-2-full-fix-error/auto_budget_prompt.txt	Same auto-budget prompt template, placed in fix-error directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T10:59:35Z

results/260202-kdd-expt-2-full/*eval_gim.sh

+        echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2
+        export AUTO_BUDGET_PROMPT=""


The script proceeds with --auto_budget_prompt "$AUTO_BUDGET_PROMPT" even when auto_budget_prompt.txt is missing, which will pass an empty prompt and likely change/degenerate auto-budget behavior. Instead of defaulting to an empty string, fail fast (exit non-zero) when --auto_budget is used but the prompt file is absent, or vendor the prompt file into this directory like the other result folders do.

Suggested change

echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2

export AUTO_BUDGET_PROMPT=""

echo "Error: auto_budget_prompt.txt not found in $script_dir" >&2

exit 1

Copilot · 2026-02-10T10:59:35Z

results/260202-kdd-expt-2-full/*eval_gim.sh

+    python -m "gimbench.mcqa.medmcqa" --use_gim_prompt --output_type cfg --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \
+        --auto_budget --auto_budget_prompt "$AUTO_BUDGET_PROMPT" \


The script proceeds with --auto_budget_prompt "$AUTO_BUDGET_PROMPT" even when auto_budget_prompt.txt is missing, which will pass an empty prompt and likely change/degenerate auto-budget behavior. Instead of defaulting to an empty string, fail fast (exit non-zero) when --auto_budget is used but the prompt file is absent, or vendor the prompt file into this directory like the other result folders do.

Copilot · 2026-02-10T10:59:35Z

results/260202-kdd-expt-2-full/*eval_api.sh

+set -x
+
+export API_KEY=your_api_key_here


set -x will echo expanded command lines, which includes --api_key "$API_KEY" and can leak keys into logs/CI output/shell history. Also, hardcoding export API_KEY=your_api_key_here encourages committing secrets-by-edit. Prefer requiring API_KEY to be provided from the environment (and exit with a clear message if missing), and disable tracing around the API calls (or avoid --api_key on the CLI and use a safer config/env mechanism if supported by gimbench).

Suggested change

set -x

export API_KEY=your_api_key_here

# Require API_KEY to be provided via the environment; do not hardcode secrets here.

if [ -z "${API_KEY:-}" ]; then

echo "Error: API_KEY environment variable is not set. Please export API_KEY before running this script." >&2

exit 1

fi

Copilot · 2026-02-10T10:59:36Z

results/260202-kdd-expt-2-full/*eval_api.sh

+        python -m "gimbench.mcqa.medmcqa" --model_type openai --model_name "$model" \
+            --api_key "$API_KEY" --base_url "$API_BASE" --no_gimkit --num_proc 40 --first_n 500


set -x will echo expanded command lines, which includes --api_key "$API_KEY" and can leak keys into logs/CI output/shell history. Also, hardcoding export API_KEY=your_api_key_here encourages committing secrets-by-edit. Prefer requiring API_KEY to be provided from the environment (and exit with a clear message if missing), and disable tracing around the API calls (or avoid --api_key on the CLI and use a safer config/env mechanism if supported by gimbench).

Copilot · 2026-02-10T10:59:36Z

results/260202-kdd-expt-2-full/*eval_gim.sh

+
+shutdown -h +3


Unconditionally scheduling a host shutdown from a repo script is risky (surprising side effects for local runs, shared machines, CI, or remote sessions). Make shutdown opt-in (e.g., behind an environment flag), or remove it and document the intended operational workflow separately.

Suggested change

shutdown -h +3

# Optional host shutdown: enable by setting ALLOW_HOST_SHUTDOWN=1 in the environment.

if [ "${ALLOW_HOST_SHUTDOWN:-0}" = "1" ]; then

echo "ALLOW_HOST_SHUTDOWN=1 set; scheduling host shutdown in 3 minutes..."

shutdown -h +3

else

echo "Skipping host shutdown. To enable, set ALLOW_HOST_SHUTDOWN=1 in the environment before running this script." >&2

fi

Copilot · 2026-02-10T10:59:36Z

results/260202-kdd-expt-2-full/*eval_api.sh

+setup_prompt
+run_api_experiments
+
+shutdown -h +3


Same concern as the offline runner: an unconditional shutdown is a hazardous default in a checked-in script. Gate it behind an explicit flag or drop it entirely to prevent accidental machine shutdowns.

Suggested change

shutdown -h +3

if [ "${ENABLE_EXPERIMENT_SHUTDOWN:-0}" = "1" ]; then

shutdown -h +3

fi

Copilot · 2026-02-10T10:59:36Z

results/260202-kdd-expt-2-full/*eval_api.sh

+        python -m "gimbench.mcqa.medmcqa" --model_type openai --model_name "$model" \
+            --api_key "$API_KEY" --base_url "$API_BASE" --no_gimkit --num_proc 40 --first_n 500
+        python -m "gimbench.mcqa.medmcqa" --use_gim_prompt --output_type json --model_type openai \
+            --model_name "$model" --api_key "$API_KEY" --base_url "$API_BASE" \


The baseline run omits --output_type while the GIM run forces --output_type json. If downstream analysis expects a consistent artifact format, this will produce incompatible outputs between baseline vs GIM runs. Consider explicitly setting the same --output_type for both baseline and prompted runs (or document why the formats intentionally differ).

Copilot · 2026-02-10T10:59:37Z

results/260202-kdd-expt-2-full/*eval_gim.sh

+
+
+run_gim_experiments() {
+    python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \


Similar to the API script, the baseline run doesn’t specify --output_type while the GIM runs set --output_type cfg. This can make result aggregation brittle if tools assume a single format. Recommend setting --output_type cfg on the baseline as well (or keeping both in the same format used by your evaluation pipeline).

Suggested change

python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \

python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" --output_type cfg \

Copilot · 2026-02-10T10:59:37Z

results/260202-kdd-expt-2-full/*eval_gim.sh

@@ -0,0 +1,54 @@
+#!/bin/bash


The filename shown in the PR uses a literal * (*eval_gim.sh). Asterisk characters in filenames are error-prone because they interact badly with shell globbing and are not portable across all environments/tools. Consider renaming these files to a concrete name (e.g., eval_gim.sh) and, if you need a glob, keep it in documentation or a wrapper script rather than in the tracked filename.

Duguce and others added 2 commits February 2, 2026 16:06

chore: backup results

f1d69d9

[pre-commit.ci] auto fixes from pre-commit.com hooks

96a8702

for more information, see https://pre-commit.ci

Ki-Seki reviewed Feb 2, 2026

View reviewed changes

results/260202-kdd-expt-2-full/*eval_gim.sh Outdated Show resolved Hide resolved

Ki-Seki reviewed Feb 2, 2026

View reviewed changes

results/260202-kdd-expt-2-full/*eval_gim.sh Outdated Show resolved Hide resolved

chore: remove gim prompt about gim models

7477757

Ki-Seki added the do not merge label Feb 2, 2026

Duguce and others added 9 commits February 2, 2026 22:02

chore: update api model eval scripts

98946b6

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8d0f4f

for more information, see https://pre-commit.ci

chore: fix api model eval scripts

83d0876

chore: fix api model eval scripts

d48fe4d

chore: ignore eval.log.* files (#76)

442ddd2

feat: add upper bound to auto reason_budget (#78)

a418efd

chore: backup

ffd772c

[pre-commit.ci] auto fixes from pre-commit.com hooks

c63c61d

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/eval-cot

6c1d37a

Ki-Seki changed the base branch from main to feat/eval-results February 10, 2026 10:53

Merge branch 'feat/eval-results' into feat/eval-cot

d3d9346

Ki-Seki marked this pull request as ready for review February 10, 2026 10:56

Copilot AI review requested due to automatic review settings February 10, 2026 10:56

Ki-Seki merged commit 03999d3 into feat/eval-results Feb 10, 2026
1 of 2 checks passed

Ki-Seki deleted the feat/eval-cot branch February 10, 2026 10:57

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: eval cot results#75

feat: eval cot results#75
Ki-Seki merged 13 commits intofeat/eval-resultsfrom
feat/eval-cot

Duguce commented Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2
		export AUTO_BUDGET_PROMPT=""

		python -m "gimbench.mcqa.medmcqa" --use_gim_prompt --output_type cfg --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \
		--auto_budget --auto_budget_prompt "$AUTO_BUDGET_PROMPT" \

-set -x
-export API_KEY=your_api_key_here
+# Require API_KEY to be provided via the environment; do not hardcode secrets here.
+if [ -z "${API_KEY:-}" ]; then
+    echo "Error: API_KEY environment variable is not set. Please export API_KEY before running this script." >&2
+    exit 1
+fi

		python -m "gimbench.mcqa.medmcqa" --model_type openai --model_name "$model" \
		--api_key "$API_KEY" --base_url "$API_BASE" --no_gimkit --num_proc 40 --first_n 500

-shutdown -h +3
+# Optional host shutdown: enable by setting ALLOW_HOST_SHUTDOWN=1 in the environment.
+if [ "${ALLOW_HOST_SHUTDOWN:-0}" = "1" ]; then
+    echo "ALLOW_HOST_SHUTDOWN=1 set; scheduling host shutdown in 3 minutes..."
+    shutdown -h +3
+else
+    echo "Skipping host shutdown. To enable, set ALLOW_HOST_SHUTDOWN=1 in the environment before running this script." >&2
+fi



		run_gim_experiments() {
		python -m "gimbench.mcqa.medmcqa" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \

Comments

Conversation

Duguce commented Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants