Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions results/260129-kdd-expt-2-full-fix-error/auto_budget_prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
I'll show you a question. Your objective is to determine the optimal number of reasoning steps required to achieve a flawless, high-confidence conclusion.

### Reasoning Standards:
- **Depth (The 200-Word Rule)**: Each step must be a high-density thinking unit (equivalent to 180–240 words of analysis).
- **Freedom of Expression**: Use **frequent line breaks**, LaTeX ($...$), and Markdown to mimic a researcher's whiteboard or professional notes.
- **Cognitive Flow**: Don't just list facts. Use parallel thinking (alternative theories), scenario simulation (what if?), and systematic elimination of distractions.
- **Natural Audit**: Every step should conclude with a 'Validation' or 'Check' that naturally flows from the preceding logic to ensure no errors propagate.

### Examples:
## Question: Why is the clear daytime sky blue instead of any other color, like violet or white?
## Response Example:
Step 1: Establishing the physical interaction between sunlight and the atmosphere.
Sunlight (white light) is composed of a broad spectrum of wavelengths. As this light enters Earth's atmosphere, it encounters gas molecules ($N_2$, $O_2$) that are significantly smaller than the wavelength of the light itself. This specific scale ratio triggers Rayleigh scattering, where the intensity of scattered light $I$ is inversely proportional to the fourth power of the wavelength:
$$I_{\text{scattered}} \propto \frac{1}{\lambda^4}$$
This means shorter wavelengths (blue and violet) are scattered much more efficiently in all directions than longer wavelengths (red and yellow).
*Validation*: I've confirmed the scale constraint; if the particles were larger (like water droplets), we would see Mie scattering (white light) instead.

Step 2: Addressing the 'Violet Paradox'.
If shorter wavelengths scatter more, a logical question arises: why isn't the sky violet? Violet light has an even shorter wavelength than blue light and should scatter more intensely. However, we must consider the Solar Spectral Irradiance and human physiology. The Sun emits significantly more 'blue' photons than 'violet' ones. Furthermore, the human eye's trichromatic vision is much more sensitive to blue than violet. Our brains essentially process the scattered mixture as a pale, saturated blue.
*Validation*: This cross-disciplinary check (Physics + Biology) explains why the observation doesn't perfectly match the raw $\lambda^{-4}$ curve.

Step 3: Systematic elimination of alternative explanations.
Could the blue color come from the ocean reflecting onto the sky? No, because the sky is blue in the middle of vast continents. Could it be O3 (Ozone) absorption? While Ozone absorbs some red light (Chappuis absorption), its effect is too weak to explain the daytime sky's brilliance. By eliminating reflection and pure absorption, we isolate scattering as the primary driver.
*Validation*: The elimination process anchors the conclusion in atmospheric scattering rather than secondary environmental factors.

Step 4: Scenario Simulation - Sunset and Atmospheric Depth.
To verify scattering, let's observe what happens when the path length of light increases at sunset. The light must travel through much more atmosphere to reach the observer. By the time it arrives, the blue light has been scattered away entirely, leaving only the unscattered longer wavelengths (reds and oranges). This 'inverse' observation perfectly validates the Rayleigh model.
*Validation*: The sunset simulation provides the 'stress test' needed to confirm the wavelength-dependency of the theory.

Step 5: Final synthesis and system-wide check.
The blue sky is the result of a precise physical-biological intersection:
1. Rayleigh scattering favors short wavelengths.
2. The Sun's output peaks in the visible range, favoring blue over violet.
3. The human eye integrates this scattered light as blue.
*Final Audit*: The logic is internally consistent across electromagnetic theory, solar physics, and human perception.
## Reasoning steps: 5
49 changes: 49 additions & 0 deletions results/260129-kdd-expt-2-full-fix-error/eval_api.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash
set -x

export API_KEY=your_api_key_here
export API_BASE="https://openrouter.ai/api/v1"

API_MODELS=(
"google/gemma-3-27b-it"
"qwen/qwen3-30b-a3b-instruct-2507"
)

DATASETS=(
"qasc"
"gpqa_diamond"
)


setup_prompt() {
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if [ -f "$script_dir/auto_budget_prompt.txt" ]; then
export AUTO_BUDGET_PROMPT="$(cat "$script_dir/auto_budget_prompt.txt")"
else
echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2
export AUTO_BUDGET_PROMPT=""
fi
export REASON_STEP_DESC="A distinct, verified reasoning step building logically on the previous one. Each step must be a high-density analysis (180–240 words equivalent) but is fully liberated in format: use frequent line breaks, LaTeX, lists, or tables as needed. The goal is logical transparency; ensure each step achieves a sub-goal, shows its full derivation, and includes a micro-verification to prevent error propagation."
}



run_api_experiments() {
python -m "gimbench.mcqa.gpqa_diamond" --use_gim_prompt --output_type json --model_type openai \
--model_name "qwen/qwen3-30b-a3b-instruct-2507" --api_key "$API_KEY" --base_url "$API_BASE" --reason_budget 1 --num_proc 40 --first_n -1
for BUDGET in {1..7..2}; do python -m "gimbench.mcqa.gpqa_diamond" --use_gim_prompt --output_type json --model_type openai \
--model_name "google/gemma-3-27b-it" --api_key "$API_KEY" --base_url "$API_BASE" --reason_budget "$BUDGET" --num_proc 40 --first_n -1; done

python -m "gimbench.mcqa.qasc" --use_gim_prompt --output_type json --model_type openai \
--model_name "minimax/minimax-m2.1" --api_key "$API_KEY" --base_url "$API_BASE" --reason_budget 1 --num_proc 40 --first_n -1
python -m "gimbench.mcqa.gpqa_diamond" --use_gim_prompt --output_type json --model_type openai \
--model_name "minimax/minimax-m2.1" --api_key "$API_KEY" --base_url "$API_BASE" --reason_budget 1 --num_proc 40 --first_n -1

}



setup_prompt
run_api_experiments

shutdown -h +3
48 changes: 48 additions & 0 deletions results/260129-kdd-expt-2-full-fix-error/eval_gim.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#!/bin/bash
set -x


GIM_MODELS=(
"Sculpt-AI/2601261-qwen-4b-50k"
)

DATASETS=(
"qasc"
"gpqa_diamond"
)


setup_prompt() {
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if [ -f "$script_dir/auto_budget_prompt.txt" ]; then
export AUTO_BUDGET_PROMPT="$(cat "$script_dir/auto_budget_prompt.txt")"
else
echo "Warning: auto_budget_prompt.txt not found in $script_dir" >&2
export AUTO_BUDGET_PROMPT=""
fi
export REASON_STEP_DESC="A distinct, verified reasoning step building logically on the previous one. Each step must be a high-density analysis (180–240 words equivalent) but is fully liberated in format: use frequent line breaks, LaTeX, lists, or tables as needed. The goal is logical transparency; ensure each step achieves a sub-goal, shows its full derivation, and includes a micro-verification to prevent error propagation."
}


run_gim_experiments() {
for ds in "${DATASETS[@]}"; do
python -m "gimbench.mcqa.$ds" --model_type vllm-offline --model_name "Qwen/Qwen3-4B-Instruct-2507" \
--no_gimkit --num_proc 40 --first_n -1
for model in "${GIM_MODELS[@]}"; do
python -m "gimbench.mcqa.$ds" --model_type vllm-offline --model_name "$model" \
--auto_budget --auto_budget_prompt "$AUTO_BUDGET_PROMPT" \
--reason_step_desc "$REASON_STEP_DESC" --num_proc 40 --first_n -1
for BUDGET in {1..7..2}; do python -m "gimbench.mcqa.$ds" --model_type vllm-offline --model_name "$model" \
--reason_budget "$BUDGET" --num_proc 40 --first_n -1; done
done
done
}



setup_prompt
run_gim_experiments



shutdown -h +3
Loading