Add benchmark results for deepseek/deepseek-chat-v3.1 by mentatbot[bot] · Pull Request #335 · AbanteAI/LoCoDiff-bench

mentatbot · 2025-08-22T18:10:52Z

This PR adds comprehensive benchmark results for the deepseek/deepseek-chat-v3.1 model against the LoCoDiff-250425 benchmark suite.

Benchmark Summary

Model: deepseek/deepseek-chat-v3.1
Total benchmarks: 200 test cases
Success rate: 26.5% (53/200 successful)
Failed cases: 143 (output mismatch)
API errors: 4 (handled automatically)
Total cost: $3.39
Concurrency: 20 parallel requests

Results Structure

The benchmark results are organized in the standard directory structure:

locodiff-250425/results/[test_case]/deepseek_deepseek-chat-v3.1/[timestamp]/
├── metadata.json      # Run metadata, costs, and success status
├── raw_response.txt   # Complete model response
├── extracted_output.txt # Code extracted from response
└── output.diff       # Diff between expected and actual output

Performance Analysis

The model achieved a 26.5% success rate on this challenging code reconstruction benchmark, with an average cost of approximately $0.017 per test case. The benchmark covers various programming languages and repositories including React, Ghostty, Qdrant, Tldraw, and Aider.

These results can be used for model comparison and analysis using the visualization tools in the benchmark pipeline.

🤖 This PR was created with Mentat. See my steps and cost here ✨

Wake on any new activity.

- Ran 200 benchmark cases with concurrency 20 - Achieved 53 successful results (26.5% success rate) - Total cost: $3.39 - Results saved in locodiff-250425/results/ directory structure Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/8e4e5d2f-4e96-4e73-9380-c564a1816210 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Reran benchmark to handle API errors from initial run - Successfully recovered 3/4 cases that had API errors - Added 1 new successful case, 2 new failed cases - 1 case still has persistent API error (JSON decode issue) - Final status: 199/200 completed (99.5%), 54 successful (27.1% success rate) - Total cost: $3.53 Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/609ae569-6826-4e80-a193-64fa2909f774 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Successfully completed the last remaining benchmark case on second retry - qdrant_lib_segment_tests_integration_payload_index_test.rs now shows legitimate failure (output mismatch) - Final status: 200/200 completed (100%), 54 successful (27% success rate) - Total cost: $3.58 - All API errors resolved Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/e70dbbff-c65b-48b7-a0ae-e5f8a2f41ae1 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Combined uv installation and setup into single step - Export PATH in each step to ensure uv is available - This should resolve the "uv: command not found" error in CI Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/0fcc53d6-6612-44d9-83d5-da76d5548d04 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Added "deepseek/deepseek-chat-v3.1": "DeepSeek Chat v3.1" to benchmark_config.yaml - Generated complete visualization pages for all 28 models including DeepSeek Chat v3.1 - Updated docs/ directory with latest benchmark results and visualizations - All 200 case pages generated successfully for the new model Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/fd1ba057-8078-4575-ba6a-07dbe60a9eaf Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

mentatbot bot requested a review from biobootloader August 22, 2025 18:10

mentatbot bot and others added 4 commits August 22, 2025 18:40

biobootloader merged commit 1c49108 into main Aug 22, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark results for deepseek/deepseek-chat-v3.1#335

Add benchmark results for deepseek/deepseek-chat-v3.1#335
biobootloader merged 5 commits intomainfrom
mentat-259

mentatbot bot commented Aug 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mentatbot bot commented Aug 22, 2025

Benchmark Summary

Results Structure

Performance Analysis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant