How to Create a Custom Benchmark

This doc will walk you through how to create a custom sys-intelligence benchmark with a simple education code. By following these steps, you’ll be able to evaluate the latest models and agents on your own system-related tasks and seamlessly integrate your benchmark into the System Intelligence Benchmark.

Prerequisites

Before creating a custom benchmark, ensure you have:

Python 3.9 or higher
Basic understanding of the benchmark framework
A clear evaluation task (tasks, how to score it) in mind

Step 1: Create Your Benchmark Directory

Choose an example benchmark that is similar to your setting as a starting point.

If your tasks only involve text-based questions, consider starting from courseexam_bench. If your benchmark focuses on algorithm design or optimization tasks, you might use cache_algo_bench as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).

Use courselab_bench, if your benchmark is related to environment setup, system understanding/implementation, performance analysis, or debugging tasks, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide a simple ReACt Agent in this example, along with guidance for integrating new agents.

For more complex benchmark, we highly recommned you take a look at the open-sources benchmark frameworks, such as HAL or Inspect. They provide many diverse benchmarks as examples.

Navigate to the benchmarks directory:
```
cd benchmarks/
```

Copy the chosen benchmark to create your new benchmark:

cp -r chosen_bench/ your_bench_name/
cd your_bench_name/

Your benchmark directory should have the following mininal structure:

your_bench_name/
├── src/
│   └── main.py         # Main evaluation logic
├── data/               # Test data and scenarios
│   └── benchmark/      # Benchmark datasets
├── tests/              # Unit tests
├── Dockerfile          # Docker configuration (optional)
├── env.toml            # Environment configuration
├── install.sh          # Installation script
├── run.sh              # Execution script
├── test.sh             # Testing script
├── requirements.txt    # Python dependencies
└── README.md           # Benchmark documentation

Step 2: Define Your Test Data

Create your evaluation dataset in a structured format:

Create a data directory if it doesn't exist:
```
mkdir -p data/benchmark/
```

Define your test cases in JSONL format (recommended):

The following is mininal example.

{"id": "task_001", "sys_prompt": "You are a helpful assistant.", "user_prompt": "Solve this problem...", "response": "Expected answer..."}
{"id": "task_002", "sys_prompt": "You are a helpful assistant.", "user_prompt": "Another task...", "response": "Expected answer..."}

Each line contains:

id: Unique identifier for the test case
sys_prompt: System prompt for the LLM
user_prompt: User query/task description
response: Expected/ground truth response

NOTES: for more complex scenarios, you can use any custom formats. See course_exam_bench and course_lab_bench for examples.

Step 3: Select or Implement Your Executor and Evaluator

The sdk/ folder provides base classes for building your benchmark. You need to select or implement both an executor (to run the LLM) and an evaluator (to score the results).

3.1 Review Available SDK Components

Check the sdk/ folder for available components:

Executors (sdk/executor.py):

Executor: Base class for executors
SimpleExecutor: Basic LLM executor that extracts code from responses

Evaluators (sdk/evaluator.py):

Evaluator: Base class for evaluators
BasicEvaluator: Provides multiple similarity metrics (syntax correctness, exact match, Jaccard similarity, cosine similarity, embeddings similarity, LLM-as-judge)
ExamEvaluator: Specialized for exam questions (single-choice, multiple-choice, short-answer)
LLMJudger: Uses LLM to judge response quality
LLMExamJudger: Uses LLM to grade exam responses

Other utilities:

sdk/llm.py: LLM interface for querying language models
sdk/utils.py: Utility functions including set_llm_endpoint_from_config()

3.2 Implement Your Main Evaluation Logic

Edit src/main.py to implement your benchmark. Here's the structure based on example_bench:

"""Benchmark for evaluating model performance on your specific task."""

import argparse
import json
import os
import sys
from datetime import datetime

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../../')))

from sdk.utils import set_llm_endpoint_from_config  
set_llm_endpoint_from_config('env.toml')

from sdk.executor import SimpleExecutor  
from sdk.evaluator import BasicEvaluator  


def main(_input_file, output_dir, _model_name, agent_name):
    """Main function for running the benchmark."""
    total_score = []

    with (
        open(_input_file, encoding='utf-8') as data,
        open(os.path.join(output_dir, 'result.jsonl'), 'w', encoding='utf-8') as output_file,
    ):
        for line in data:
            item = json.loads(line)
            print('============ ' + item['id'] + ' ============')

            # Step 1: Select or implement your executor
            if agent_name == "llm":
                executor = SimpleExecutor(_model_name, item['sys_prompt'])
            else:
                # You can add more agents/executors here
                # Example: CustomExecutor, AgentExecutor, etc.
                raise ValueError(f'Unknown agent name: {agent_name}')

            # Step 2: Execute the task
            response = executor.run(item['user_prompt'])

            # Step 3: Select or implement your evaluator
            evaluator = BasicEvaluator(_model_name)
            offline_metrics = evaluator.eval(
                question=item['user_prompt'],
                answer=response,
                groundtruth=item
            )

            # Step 4: Collect scores
            total_score.append((
                offline_metrics['syntax_acc'],
                offline_metrics['exact_match'],
                offline_metrics['jaccard_similarity'],
                offline_metrics['cosine_similarity'],
                offline_metrics['embeddings_similarity'],
                offline_metrics['llmjudger_rating'],
            ))

            # Step 5: Save individual result
            result = {
                'id': item['id'],
                'sys_prompt': item['sys_prompt'],
                'user_prompt': item['user_prompt'],
                'groundtruth': item['response'],
                'response': response,
                'syntax_acc': offline_metrics['syntax_acc'],
                'exact_match': offline_metrics['exact_match'],
                'jaccard_similarity': offline_metrics['jaccard_similarity'],
                'cosine_similarity': offline_metrics['cosine_similarity'],
                'embeddings_similarity': offline_metrics['embeddings_similarity'],
                'llmjudger_rating': offline_metrics['llmjudger_rating'],
                'llmjudger_answer': offline_metrics['llmjudger_answer'],
            }
            print('Evaluation Result:')
            print(result)
            output_file.write(json.dumps(result))
            output_file.write('\n')

    # Step 6: Calculate and save average scores
    avg_score = [sum(values) / len(values) for values in list(zip(*total_score))]
    avg_score_dict = {
        'syntax_acc': avg_score[0],
        'exact_match': avg_score[1],
        'jaccard_similarity': avg_score[2],
        'cosine_similarity': avg_score[3],
        'embeddings_similarity': avg_score[4],
        'llmjudger_rating': avg_score[5],
        'final_score': sum(avg_score[:5]) / 5,  # It's final score for your benchmark, you should customize it
    }
    with open(os.path.join(output_dir, 'avg_score.json'), 'w', encoding='utf-8') as avg_score_file:
        json.dump(avg_score_dict, avg_score_file, indent=4)
    print('************ Final average score ************')
    print(avg_score_dict)

3.3 Customization Options

Option A: Use Existing SDK Components (Recommended)

For standard evaluations, use the provided executors and evaluators:

from sdk.executor import SimpleExecutor
from sdk.evaluator import BasicEvaluator  # or ExamEvaluator

executor = SimpleExecutor(model_name, sys_prompt)
evaluator = BasicEvaluator(model_name)

Option B: Implement Custom Executor

For specialized execution needs (e.g., code execution, agent-based reasoning):

from sdk.executor import Executor

class CustomExecutor(Executor):
    def __init__(self, model_name, sys_prompt):
        super().__init__(model_name, sys_prompt)
        # Your custom initialization

    def run(self, user_prompt, lang=''):
        # Your custom execution logic
        # Example: run code, use specialized prompting, multi-turn dialogue
        pass

Option C: Implement Custom Evaluator

For specialized evaluation metrics:

from sdk.evaluator import Evaluator

class CustomEvaluator(Evaluator):
    def __init__(self, model_name):
        super().__init__()
        self.model_name = model_name

    def eval(self, question, answer, groundtruth):
        # Your custom evaluation logic
        # Example: code execution validation, domain-specific metrics
        return {
            'custom_metric_1': score1,
            'custom_metric_2': score2,
        }

3.4 Examples from Existing Benchmarks

example_bench/src/main.py: Uses SimpleExecutor + BasicEvaluator for basic evaluation with multiple similarity metrics
course_exam_bench/: Uses SimpleExecutor + ExamEvaluator for grading exam questions
cache_algo_bench/: Uses custom evaluator (cache_simulator) for code execution and performance testing
course_lab_bench/: Uses agent-based executor for complex project execution

Step 4: Configure Your Benchmark

4.1 Update `env.toml`

Configure your benchmark settings:

[llm]
OPENAI_API_KEY = "sk-XXXX"
AZURE_API_KEY = "XXX"
AZURE_API_BASE = "XXX"
AZURE_API_VERSION = "2024-05-01-preview"
ANTHROPIC_API_KEY = "sk-ant-XXXX"

[hardware]
use_gpu = false

[env-docker]
image = "default"  # or specify custom image
entrypoint = "run.sh"

4.2 Update `requirements.txt`

Add any additional dependencies your benchmark needs beyond the SDK requirements.

4.3 Update `install.sh`

Ensure all dependencies are installed:

#!/bin/bash

set -e

echo "Installing dependencies for your_bench_name..."

# You can add system-level dependencies here if needed

# Create virtual environment if needed
if [ ! -d "venv" ]; then
    python3 -m venv venv
fi

source venv/bin/activate

# Install requirements
pip install -r requirements.txt

echo "Installation complete!"

4.4 Update `run.sh`

Configure the execution script:

#!/bin/bash

set -e

# Activate virtual environment
source venv/bin/activate

# Run the benchmark
python src/main.py -m "your-model-name" -a llm

echo "Benchmark execution complete!"
echo "Results saved to: outputs/"

4.5 Update the `README.md` file with:

Scenario Description: Explain what your benchmark evaluates
Task Details: Describe input, output, and evaluation criteria
Setup Instructions: Docker and manual setup steps
Example Results (optional): Show sample outputs or performance metrics

See the template in benchmarks/example_bench/README.md for structure.

Step 5: Add Tests (Optional but Recommended)

Create tests in the tests/ directory:

# tests/test_evaluator.py
import unittest
from src.main import main

class TestYourBenchmark(unittest.TestCase):
    def test_evaluation(self):
        """Test benchmark evaluation."""
        # Implement test
        pass

if __name__ == "__main__":
    unittest.main()

Run tests:

./test.sh

Step 6: Test Your Benchmark

Local Testing

Install dependencies:
```
./install.sh
```
Configure env.toml with your LLM credentials
Run the benchmark:
```
./run.sh
```

Docker Testing (optional)

Build the Docker image:
```
docker build -t your_bench_name .
```
Run in Docker:
```
docker run --rm your_bench_name
```

Step 7: Integrate with CLI

To make your benchmark available through the CLI:

Update cli/run_docker.sh or /run_all_local.sh if needed

Your benchmark will now be accessible via:

cd cli
./run_all_local.sh <model_name>
# or
./run_docker.sh <model_name>

Contributing Your Benchmark

Once your benchmark is complete and tested:

Ensure all tests pass
Update the main README.md to list your benchmark
Submit a pull request following the contribution guidelines

Your contributions help make the system intelligence benchmark more comprehensive, robust, and valuable for evaluating AI systems!

Others

Best Practices

Use the SDK: Leverage the evaluator and executor base classes in sdk/ for consistency
Standardized Output: Follow the JSONL format for results and JSON for summaries
Error Handling: Implement robust error handling and timeouts
Documentation: Provide clear documentation and examples
Reproducibility: Ensure your benchmark produces consistent results
Code Quality: Follow the project's coding standards (Ruff, 120 char lines, Google-style docstrings). Follow the PreChecks.md for code formatting and linting guidelines.

Examples

Refer to existing benchmarks for inspiration:

example_bench/: Minimal template with SimpleExecutor + BasicEvaluator
cache_algo_bench/: Code execution, algorithm simulation and performance evaluation
courseexam_bench/: Multiple-choice and short-answer questions with ExamEvaluator
courselab_bench/: Complex project-based evaluation with agent executors

Getting Help

Review the main README.md for overall project structure
Check the sdk/ folder for available base classes and utilities
Check existing benchmarks for implementation patterns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Create a Custom Benchmark

Prerequisites

Step 1: Create Your Benchmark Directory

Step 2: Define Your Test Data

Step 3: Select or Implement Your Executor and Evaluator

3.1 Review Available SDK Components

3.2 Implement Your Main Evaluation Logic

3.3 Customization Options

3.4 Examples from Existing Benchmarks

Step 4: Configure Your Benchmark

4.1 Update `env.toml`

4.2 Update `requirements.txt`

4.3 Update `install.sh`

4.4 Update `run.sh`

4.5 Update the `README.md` file with:

Step 5: Add Tests (Optional but Recommended)

Step 6: Test Your Benchmark

Local Testing

Docker Testing (optional)

Step 7: Integrate with CLI

Contributing Your Benchmark

Others

Best Practices

Examples

Getting Help

FilesExpand file tree

creating_benchmark.md

Latest commit

History

creating_benchmark.md

File metadata and controls

How to Create a Custom Benchmark

Prerequisites

Step 1: Create Your Benchmark Directory

Step 2: Define Your Test Data

Step 3: Select or Implement Your Executor and Evaluator

3.1 Review Available SDK Components

3.2 Implement Your Main Evaluation Logic

3.3 Customization Options

3.4 Examples from Existing Benchmarks

Step 4: Configure Your Benchmark

4.1 Update env.toml

4.2 Update requirements.txt

4.3 Update install.sh

4.4 Update run.sh

4.5 Update the README.md file with:

Step 5: Add Tests (Optional but Recommended)

Step 6: Test Your Benchmark

Local Testing

Docker Testing (optional)

Step 7: Integrate with CLI

Contributing Your Benchmark

Others

Best Practices

Examples

Getting Help

4.1 Update `env.toml`

4.2 Update `requirements.txt`

4.3 Update `install.sh`

4.4 Update `run.sh`

4.5 Update the `README.md` file with: