English | 简体中文
AACR-Bench is the industry's first multilingual, repository-level context-aware code review evaluation dataset, designed to assess the performance of large language models in automated code review tasks. The dataset comprises 200 real Pull Requests from 50 active open-source projects, covering 10 mainstream programming languages. Each instance includes not only code changes but also preserves complete repository context, authentically reproducing the entire code review process. Through human-LLM collaborative review combined with multi-round expert annotation, we ensure high quality and comprehensiveness of the data.

Covers 10 mainstream programming languages used in projects:
- System-level languages: C++, Rust, Go
- Enterprise languages: Java, C#, TypeScript
- Scripting languages: Python, JavaScript, Ruby, PHP
- Preserves complete project structure
- Supports cross-file references and inter-module interaction analysis
- Includes PR metadata (description, title, comments, etc.)
- 80+ senior software engineers with 2+ years of experience
- Covers frontend, backend, architecture, and other domains
- Three rounds of cross-validation
- Systematic issue discovery
- Comprehensive defect identification
- Improvement suggestion generation
Quality Assurance Process: GitHub human comments → LLM enhancement → Expert multi-round cross-annotation → Consistency validation
AACR-Bench provides systematic evaluation capabilities across four core dimensions, supporting diverse research and application scenarios:
| 1️⃣ Multi-language Evaluation (10 Languages) | 2️⃣ Positioning Accuracy Evaluation (Line-level) |
|---|---|
| • Cross-language performance comparison: Identify model strengths and weaknesses across languages • Language-specific optimization: Improve model capabilities for specific languages • Generalization assessment: Test model's language transfer effectiveness |
• Precise positioning: Assess single-line/multi-line issue location accuracy • Cross-file tracking: Test cross-file reference identification capability • Context boundaries: Verify issue scope judgment accuracy |
| 3️⃣ Issue Classification Evaluation (4 Categories) | 4️⃣ Context Understanding Evaluation (3 Levels) |
| • Classification accuracy: Assess issue type identification capability • Severity assessment: Test issue priority judgment • Specialized capabilities: Analyze detection rates for specific issue types |
• Diff-level understanding: Basic code change analysis • File-level understanding: Complete file logic comprehension • Repo-level understanding: Project-wide dependency analysis |
- Performance benchmarking: Evaluate new models' code review capabilities using unified standards
- Weakness analysis: Discover model shortcomings through fine-grained metrics (e.g., low recall in certain languages)
- Iterative optimization validation: Quantify improvement effects, guide continuous model optimization
- Comparative studies: Fair comparison of different model architectures/training methods
- Ablation experiments: Analyze the impact of different context levels on review quality
- New method validation: Provide standardized evaluation environment for innovative algorithms
- Model selection: Help teams choose review models suitable for project characteristics
- Pre-deployment validation: Ensure model reliability in production environments
- Continuous monitoring: Track model performance changes in actual use
git clone https://github.com/alibaba/aacr-bench.git
cd aacr-benchpip install -r requirements.txtEdit configs/config.json and set the Claude CLI installation path:
{
"cli_path": "your_path_to_claude.cmd",
"data_path": "your_path_to_positive_samples.json"
}Create a .env file in the evaluator_runner/utils/ directory:
LLM_MODEL_URL="your_llm_model_url"
LLM_MODEL="your_llm_model"
LLM_API_KEY="your_llm_api_key"
EMBEDDING_MODEL_URL="your_embedding_model_url"
EMBEDDING_MODEL="your_embedding_model"
EMBEDDING_API_KEY="your_embedding_api_key"For the first run, you need to convert the raw data to task format. Uncomment and run in main.py:
if __name__ == "__main__":
load_data_as_task() # First run: generate task fileThis will:
- Read the raw dataset (specified by
data_path) - Add
finishflag to each PR for progress tracking - Generate
tmp_data.jsontask file
cd claude-code-demo
python main.pyimport asyncio
from evaluator_runner import (
get_evaluator_ans_from_json,
load_generated_comments_from_file,
EvaluatorConfig
)
async def main():
# Load AI-generated comments to evaluate
generated_comments = load_generated_comments_from_file("path/to/comments.txt")
# Reference comments (load from positive_samples.json)
reference_comments = [...]
# Run evaluation
result = await get_evaluator_ans_from_json(
github_pr_url="https://github.com/owner/repo/pull/123",
generated_comments=generated_comments,
good_comments=reference_comments
)
print(f"Location Match Rate: {result['positive_line_match_rate']}")
print(f"Semantic Match Rate: {result['positive_match_rate']}")
asyncio.run(main())# Configure parameters in evaluator_runner/example_test.py, then run:
python evaluator_runner/example_test.py{
"type": "array",
"item": {
"change_line_count": {"type": "integer", "description": "Number of changed lines"},
"project_main_language": {"type": "string", "description": "Project main language"},
"source_commit": {"type": "string", "description": "Source commit"},
"target_commit": {"type": "string", "description": "Target commit"},
"githubPrUrl": {"type": "string", "description": "GitHub PR URL"},
"comments": {
"is_ai_comment": {"type": "boolean", "description": "Whether it's an AI comment"},
"note": {"type": "string", "description": "Comment content"},
"path": {"type": "string", "description": "File path"},
"side": {"type": "string", "description": "Comment anchor position"},
"source_model": {"type": "string", "description": "Source model"},
"from_line": {"type": "integer", "description": "Start line number"},
"to_line": {"type": "integer", "description": "End line number"},
"category": {"type": "string", "description": "Issue category: Security/Defect/Maintainability/Performance"},
"context": {"type": "string", "description": "Comment scope: diff/file/repo level"}
}
}
}We employ a multidimensional metric system to comprehensively evaluate code review model performance. For complete metric definitions, calculation methods, and language-specific statistics, please refer to metrics.md.
| Metric | Description | Formula |
|---|---|---|
| Precision | Proportion of valid comments generated | Valid matches / Total generated |
| Recall | Ability to discover annotated issues | Valid matches / Dataset valid count |
| Line Precision | Ability to precisely locate code lines | Line matches / Total generated |
| Noise Rate | Proportion of invalid or incorrect comments | Unmatched / Total generated |
We welcome community contributions! If you want to contribute to AACR-Bench, please follow these steps:
- Fork this repository
- Create feature branch (
git checkout -b feat/add-new-prs) - Commit changes (
git commit -m 'feat: add new PRs') - Push to branch (
git push origin feat/add-new-prs) - Create Pull Request
For detailed contribution guidelines, please refer to CONTRIBUTING.md
| Name | GitHub | Domain | Responsibilities |
|---|---|---|---|
| Zhengfeng Li | @lizhengfeng | Project Lead | Overall architecture design, technical direction |
| Boge Wang | @wangboge | Technical Consultant | Technical Solution Review, technical Guidance |
| Lei Zhang | @zhanglei | Data Architecture | Evaluation framework, metric system, performance optimization |
| Yongda Yu | @yuyongda | Evaluation System | Data schema design, evaluation protocol, quality standards |
| Xinxin Guo | @guoxinxin | Annotation Platform | Annotation system development, workflow design, quality assurance |
| Minghui Yu | @yuminghui | AI Enhancement | LLM annotation pipeline, model selection, prompt optimization |
| Zhengqi Zhuang | @zhuangzhengqi | Engineering | CI/CD pipeline, automated testing, deployment scripts |
This project is licensed under the Apache License 2.0. For details, please see the LICENSE file.
If you use AACR-Bench in your research, please cite our paper:
@misc{zhang2026aacrbenchevaluatingautomaticcode,
title={AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context},
author={Lei Zhang and Yongda Yu and Minghui Yu and Xinxin Guo and Zhengqi Zhuang and Guoping Rong and Dong Shao and Haifeng Shen and Hongyu Kuang and Zhengfeng Li and Boge Wang and Guoan Zhang and Bangyu Xiang and Xiaobin Xu},
year={2026},
eprint={2601.19494},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2601.19494},
}- v1.0 (2026.01): Initial release - 200 PRs, 10 languages
- Thanks to all contributors who participated in data annotation, especially core contributors who completed 15+ valid annotations. Full list in CONTRIBUTORS.md.
- Thanks to open-source project maintainers for providing original PR data.



