Add benchmark results for anthropic/claude-4.5-sonnet#339
Merged
biobootloader merged 2 commits intomainfrom Sep 29, 2025
Merged
Add benchmark results for anthropic/claude-4.5-sonnet#339biobootloader merged 2 commits intomainfrom
biobootloader merged 2 commits intomainfrom
Conversation
Completed comprehensive benchmark run for anthropic/claude-4.5-sonnet on the locodiff-250425 benchmark set. ## Results Summary - **Total cases**: 200/200 (100% attempted) - **✅ Successful**: 157/200 (78.5%) - **❌ Failed**: 43/200 (21.5% - all output mismatches) - **⚠️ API Errors**: 0/200 (0%) - **💰 Total cost**: $47.65 ## Benchmark Details The benchmark was run with: - Concurrency: 20 - Benchmark directory: locodiff-250425 - All 200 test cases from the benchmark set The model achieved a 78.5% success rate, with all failures being output mismatches (no API errors or empty outputs in the final results). ## Notes This benchmark run required 3 iterations: 1. Initial run: 142 successful, 17 API errors (model availability issue) 2. Second run: 14 successful, 1 API error (transient JSONDecodeError) 3. Third run: 1 successful, 0 API errors All API errors were successfully resolved through retries. Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/3e467591-0742-472f-ae6d-47b9f356add2 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Added display name for anthropic/claude-4.5-sonnet in benchmark_config.yaml - Generated visualization pages for all models including the new Sonnet 4.5 results - Updated docs/index.html with latest benchmark data - Created 200 case pages for Sonnet 4.5 model - Updated chart data and styling Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/92564c0c-4e97-4d07-81a4-4113ef5127da Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds comprehensive benchmark results for
anthropic/claude-4.5-sonneton the locodiff-250425 benchmark set.Results Summary
Benchmark Configuration
Performance Analysis
The model achieves a 78.5% success rate on this benchmark, with all failures being output mismatches. No API errors or empty outputs remain in the final results.
Execution Notes
This benchmark run required 3 iterations to complete:
All API errors were successfully resolved through retries, demonstrating the robustness of the retry mechanism.
🤖 This PR was created with Mentat. See my steps and cost here ✨