Add benchmark results for deepseek/deepseek-chat-v3.1#335
Merged
biobootloader merged 5 commits intomainfrom Aug 22, 2025
Merged
Add benchmark results for deepseek/deepseek-chat-v3.1#335biobootloader merged 5 commits intomainfrom
biobootloader merged 5 commits intomainfrom
Conversation
- Ran 200 benchmark cases with concurrency 20 - Achieved 53 successful results (26.5% success rate) - Total cost: $3.39 - Results saved in locodiff-250425/results/ directory structure Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/8e4e5d2f-4e96-4e73-9380-c564a1816210 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Reran benchmark to handle API errors from initial run - Successfully recovered 3/4 cases that had API errors - Added 1 new successful case, 2 new failed cases - 1 case still has persistent API error (JSON decode issue) - Final status: 199/200 completed (99.5%), 54 successful (27.1% success rate) - Total cost: $3.53 Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/609ae569-6826-4e80-a193-64fa2909f774 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Successfully completed the last remaining benchmark case on second retry - qdrant_lib_segment_tests_integration_payload_index_test.rs now shows legitimate failure (output mismatch) - Final status: 200/200 completed (100%), 54 successful (27% success rate) - Total cost: $3.58 - All API errors resolved Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/e70dbbff-c65b-48b7-a0ae-e5f8a2f41ae1 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Combined uv installation and setup into single step - Export PATH in each step to ensure uv is available - This should resolve the "uv: command not found" error in CI Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/0fcc53d6-6612-44d9-83d5-da76d5548d04 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Added "deepseek/deepseek-chat-v3.1": "DeepSeek Chat v3.1" to benchmark_config.yaml - Generated complete visualization pages for all 28 models including DeepSeek Chat v3.1 - Updated docs/ directory with latest benchmark results and visualizations - All 200 case pages generated successfully for the new model Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/fd1ba057-8078-4575-ba6a-07dbe60a9eaf Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds comprehensive benchmark results for the
deepseek/deepseek-chat-v3.1model against the LoCoDiff-250425 benchmark suite.Benchmark Summary
deepseek/deepseek-chat-v3.1Results Structure
The benchmark results are organized in the standard directory structure:
Performance Analysis
The model achieved a 26.5% success rate on this challenging code reconstruction benchmark, with an average cost of approximately $0.017 per test case. The benchmark covers various programming languages and repositories including React, Ghostty, Qdrant, Tldraw, and Aider.
These results can be used for model comparison and analysis using the visualization tools in the benchmark pipeline.
🤖 This PR was created with Mentat. See my steps and cost here ✨