GitHub - alibaba/aacr-bench: An Alibaba open-source multi-language benchmark for evaluating LLMs in repository-level automatic code review, featuring an AI-assisted and expert-verified dataset.

📋 Introduction

AACR-Bench is the industry's first multilingual, repository-level context-aware code review evaluation dataset, designed to assess the performance of large language models in automated code review tasks. The dataset comprises 200 real Pull Requests from 50 active open-source projects, covering 10 mainstream programming languages. Each instance includes not only code changes but also preserves complete repository context, authentically reproducing the entire code review process. Through human-LLM collaborative review combined with multi-round expert annotation, we ensure high quality and comprehensiveness of the data.

✨ Core Features

🌍 Multi-language Coverage

Covers 10 mainstream programming languages used in projects:

System-level languages: C++, Rust, Go
Enterprise languages: Java, C#, TypeScript
Scripting languages: Python, JavaScript, Ruby, PHP

📁 Repository-level Context

Preserves complete project structure
Supports cross-file references and inter-module interaction analysis
Includes PR metadata (description, title, comments, etc.)

🤖 Human Expert + LLM Enhanced Annotation

Professional Annotation Team

80+ senior software engineers with 2+ years of experience
Covers frontend, backend, architecture, and other domains
Three rounds of cross-validation

LLM Intelligent Enhancement

Systematic issue discovery
Comprehensive defect identification
Improvement suggestion generation

Quality Assurance Process: GitHub human comments → LLM enhancement → Expert multi-round cross-annotation → Consistency validation

🎯 Evaluation Capabilities and Applications

AACR-Bench provides systematic evaluation capabilities across four core dimensions, supporting diverse research and application scenarios:

Evaluation Dimension System

1️⃣ Multi-language Evaluation (10 Languages)	2️⃣ Positioning Accuracy Evaluation (Line-level)
• Cross-language performance comparison: Identify model strengths and weaknesses across languages • Language-specific optimization: Improve model capabilities for specific languages • Generalization assessment: Test model's language transfer effectiveness	• Precise positioning: Assess single-line/multi-line issue location accuracy • Cross-file tracking: Test cross-file reference identification capability • Context boundaries: Verify issue scope judgment accuracy
3️⃣ Issue Classification Evaluation (4 Categories)	4️⃣ Context Understanding Evaluation (3 Levels)
• Classification accuracy: Assess issue type identification capability • Severity assessment: Test issue priority judgment • Specialized capabilities: Analyze detection rates for specific issue types	• Diff-level understanding: Basic code change analysis • File-level understanding: Complete file logic comprehension • Repo-level understanding: Project-wide dependency analysis

Typical Application Scenarios

Model Development

Performance benchmarking: Evaluate new models' code review capabilities using unified standards
Weakness analysis: Discover model shortcomings through fine-grained metrics (e.g., low recall in certain languages)
Iterative optimization validation: Quantify improvement effects, guide continuous model optimization

Academic Research

Comparative studies: Fair comparison of different model architectures/training methods
Ablation experiments: Analyze the impact of different context levels on review quality
New method validation: Provide standardized evaluation environment for innovative algorithms

Engineering Practice

Model selection: Help teams choose review models suitable for project characteristics
Pre-deployment validation: Ensure model reliability in production environments
Continuous monitoring: Track model performance changes in actual use

🚀 Quick Start

Clone Repository and Download Dataset

git clone https://github.com/alibaba/aacr-bench.git
cd aacr-bench

Install Dependencies

pip install -r requirements.txt

Configure Claude CLI

Edit configs/config.json and set the Claude CLI installation path:

{
  "cli_path": "your_path_to_claude.cmd",
  "data_path": "your_path_to_positive_samples.json"
}

Configure Evaluation Environment Variables

Create a .env file in the evaluator_runner/utils/ directory:

LLM_MODEL_URL="your_llm_model_url"
LLM_MODEL="your_llm_model"
LLM_API_KEY="your_llm_api_key"

EMBEDDING_MODEL_URL="your_embedding_model_url"
EMBEDDING_MODEL="your_embedding_model"
EMBEDDING_API_KEY="your_embedding_api_key"

Prepare Dataset

For the first run, you need to convert the raw data to task format. Uncomment and run in main.py:

if __name__ == "__main__":
    load_data_as_task()  # First run: generate task file

This will:

Read the raw dataset (specified by data_path)
Add finish flag to each PR for progress tracking
Generate tmp_data.json task file

Run Code Review

cd claude-code-demo
python main.py

Run Evaluation

import asyncio
from evaluator_runner import (
    get_evaluator_ans_from_json,
    load_generated_comments_from_file,
    EvaluatorConfig
)

async def main():
    # Load AI-generated comments to evaluate
    generated_comments = load_generated_comments_from_file("path/to/comments.txt")
    
    # Reference comments (load from positive_samples.json)
    reference_comments = [...]
    
    # Run evaluation
    result = await get_evaluator_ans_from_json(
        github_pr_url="https://github.com/owner/repo/pull/123",
        generated_comments=generated_comments,
        good_comments=reference_comments
    )
    
    print(f"Location Match Rate: {result['positive_line_match_rate']}")
    print(f"Semantic Match Rate: {result['positive_match_rate']}")

asyncio.run(main())

Batch Evaluation

# Configure parameters in evaluator_runner/example_test.py, then run:
python evaluator_runner/example_test.py

📈 Data Overview

Dataset Scale

200

Pull Requests

10

Programming Languages

50

Source Projects

2,145

Review Comments

Classification Statistics

Language Distribution

Data Format

{
  "type": "array",
  "item": {
    "change_line_count": {"type": "integer", "description": "Number of changed lines"},
    "project_main_language": {"type": "string", "description": "Project main language"},
    "source_commit": {"type": "string", "description": "Source commit"},
    "target_commit": {"type": "string", "description": "Target commit"},
    "githubPrUrl": {"type": "string", "description": "GitHub PR URL"},
    "comments": {
      "is_ai_comment": {"type": "boolean", "description": "Whether it's an AI comment"},
      "note": {"type": "string", "description": "Comment content"},
      "path": {"type": "string", "description": "File path"},
      "side": {"type": "string", "description": "Comment anchor position"},
      "source_model": {"type": "string", "description": "Source model"},
      "from_line": {"type": "integer", "description": "Start line number"},
      "to_line": {"type": "integer", "description": "End line number"},
      "category": {"type": "string", "description": "Issue category: Security/Defect/Maintainability/Performance"},
      "context": {"type": "string", "description": "Comment scope: diff/file/repo level"}
    }
  }
}

📏 Evaluation Metrics

We employ a multidimensional metric system to comprehensively evaluate code review model performance. For complete metric definitions, calculation methods, and language-specific statistics, please refer to metrics.md.

Core Metrics

Metric	Description	Formula
Precision	Proportion of valid comments generated	`Valid matches / Total generated`
Recall	Ability to discover annotated issues	`Valid matches / Dataset valid count`
Line Precision	Ability to precisely locate code lines	`Line matches / Total generated`
Noise Rate	Proportion of invalid or incorrect comments	`Unmatched / Total generated`

🤝 Contributing

We welcome community contributions! If you want to contribute to AACR-Bench, please follow these steps:

Fork this repository
Create feature branch (git checkout -b feat/add-new-prs)
Commit changes (git commit -m 'feat: add new PRs')
Push to branch (git push origin feat/add-new-prs)
Create Pull Request

For detailed contribution guidelines, please refer to CONTRIBUTING.md

👥 Authors and Maintainers

Name	GitHub	Domain	Responsibilities
Zhengfeng Li	@lizhengfeng	Project Lead	Overall architecture design, technical direction
Boge Wang	@wangboge	Technical Consultant	Technical Solution Review, technical Guidance
Lei Zhang	@zhanglei	Data Architecture	Evaluation framework, metric system, performance optimization
Yongda Yu	@yuyongda	Evaluation System	Data schema design, evaluation protocol, quality standards
Xinxin Guo	@guoxinxin	Annotation Platform	Annotation system development, workflow design, quality assurance
Minghui Yu	@yuminghui	AI Enhancement	LLM annotation pipeline, model selection, prompt optimization
Zhengqi Zhuang	@zhuangzhengqi	Engineering	CI/CD pipeline, automated testing, deployment scripts

📄 License

This project is licensed under the Apache License 2.0. For details, please see the LICENSE file.

📚 Citation

If you use AACR-Bench in your research, please cite our paper:

@misc{zhang2026aacrbenchevaluatingautomaticcode,
      title={AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context}, 
      author={Lei Zhang and Yongda Yu and Minghui Yu and Xinxin Guo and Zhengqi Zhuang and Guoping Rong and Dong Shao and Haifeng Shen and Hongyu Kuang and Zhengfeng Li and Boge Wang and Guoan Zhang and Bangyu Xiang and Xiaobin Xu},
      year={2026},
      eprint={2601.19494},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.19494}, 
}

🗺️ Roadmap

v1.0 (2026.01): Initial release - 200 PRs, 10 languages

🌟 Acknowledgments

Thanks to all contributors who participated in data annotation, especially core contributors who completed 15+ valid annotations. Full list in CONTRIBUTORS.md.
Thanks to open-source project maintainers for providing original PR data.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
claude-code-demo		claude-code-demo
dataset		dataset
docs		docs
evaluator_runner		evaluator_runner
imgs		imgs
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING.zh-CN.md		CONTRIBUTING.zh-CN.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📋 Introduction

✨ Core Features

🌍 Multi-language Coverage

📁 Repository-level Context

🤖 Human Expert + LLM Enhanced Annotation

Professional Annotation Team

LLM Intelligent Enhancement

🎯 Evaluation Capabilities and Applications

Evaluation Dimension System

Typical Application Scenarios

Model Development

Academic Research

Engineering Practice

🚀 Quick Start

Clone Repository and Download Dataset

Install Dependencies

Configure Claude CLI

Configure Evaluation Environment Variables

Prepare Dataset

Run Code Review

Run Evaluation

Batch Evaluation

📈 Data Overview

Dataset Scale

200

10

50

2,145

Classification Statistics

Language Distribution

Data Format

📏 Evaluation Metrics

Core Metrics

🤝 Contributing

👥 Authors and Maintainers

📄 License

📚 Citation

🗺️ Roadmap

🌟 Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages