| version | 2.0 |
|---|---|
| last_updated | 2025-12-22 |
This guide covers common issues, debugging techniques, and solutions for the pytest-based evaluation system.
When something goes wrong, check:
- Exit code: What was
pytest_exit_code? (0=ok, 1=failures, 2-5=infrastructure) - Infrastructure failure: Is
infrastructure_failure: true? - Tests collected: Were tests discovered? Check
pytest_collected - Pytest output: Read
evaluation/stdout.txtandevaluation/stderr.txt - CTRF report: Check
evaluation/report.jsonfor raw results
| Code | Meaning | Action |
|---|---|---|
| 0 | All tests passed | Success! |
| 1 | Some tests failed | Review failure messages |
| 2 | Interrupted | Check timeout or manual cancel |
| 3 | Internal error | Pytest itself crashed - check stderr |
| 4 | Usage error | Invalid pytest options |
| 5 | No tests collected | Tests not found - check paths |
Infrastructure failures mean pytest itself failed to run properly.
Symptoms:
pytest_collected: 0infrastructure_failure: true- Empty or no test results
Causes and Solutions:
-
Test files not copied correctly
# Check test directory in workspace ls -la .evaluation_tests/ -
Wrong checkpoint in test filename
# Must be: test_checkpoint_1.py (not test_cp1.py) problems/my_problem/tests/test_checkpoint_1.py -
Import errors in test file
# Check pytest output for import errors cat evaluation/stderr.txt -
Missing conftest.py
# Required in tests/conftest.py def pytest_addoption(parser): parser.addoption("--entrypoint", action="store", required=True) parser.addoption("--checkpoint", action="store", required=True)
Symptoms:
infrastructure_failure: true- Errors in stderr about syntax or imports
Causes and Solutions:
-
Syntax error in test file
# Check for Python syntax errors python -m py_compile tests/test_checkpoint_1.py -
Missing pytest dependency
# Add to problem config test_dependencies: - "some-package>=1.0"
-
Fixture not defined
# Make sure conftest.py defines all required fixtures @pytest.fixture def entrypoint_argv(request): return shlex.split(request.config.getoption("--entrypoint"))
Check entrypoint:
# The entrypoint passed to tests
cat evaluation/stdout.txt | grep "entrypoint"Check if submission runs:
# Try running manually in workspace
cd outputs/checkpoint_1
python main.py --helpGet failure details:
# Load results and inspect failures
for test in results.tests:
if test.status == "failed":
print(f"{test.id}: {test.failure_message}")Check CTRF report for details:
import json
with open("evaluation/report.json") as f:
ctrf = json.load(f)
for test in ctrf["results"]["tests"]:
if test["status"] == "failed":
print(test["name"], test.get("message", ""))Symptoms:
- Tests killed after timeout
pytest-timeoutmessages in output
Solutions:
-
Increase checkpoint timeout
checkpoints: checkpoint_1: timeout: 120 # seconds
-
Check for infinite loops in submission
-
Check for blocking I/O
Symptoms:
- stderr shows pip/uv errors
infrastructure_failure: true
Solutions:
-
Check dependency format
test_dependencies: - "requests>=2.28" # Version specifier - "pyyaml" # Just package name
-
Check for incompatible versions
# Try installing manually uvx --with=pytest --with=my-package pytest --version
Check PyPI name:
pip search my-package # Verify package existsSymptoms:
- "PytestUnknownMarkWarning" in output
- Tests still run but with warnings
Solutions:
-
Register marker in problem config
markers: my_marker: description: "My custom marker" group: Functionality
-
Use built-in markers
@pytest.mark.error(GroupType.ERROR)@pytest.mark.functionality(GroupType.FUNCTIONALITY)@pytest.mark.regression(GroupType.REGRESSION)
Check marker precedence:
- Prior checkpoint tests → REGRESSION (regardless of markers)
@pytest.mark.error→ ERROR (current checkpoint only)@pytest.mark.regression→ REGRESSION- Custom markers from config
@pytest.mark.functionality→ FUNCTIONALITY- Default → CORE
Check Docker status:
docker ps -a
docker logs <container_id>Common causes:
- Port conflicts
- Resource limits
- Image not found
Check connectivity:
# Inside container
curl -v http://host.docker.internal:8080Check permissions:
ls -la /workspace # Inside container# View stdout (test output)
cat outputs/checkpoint_1/evaluation/stdout.txt
# View stderr (errors and warnings)
cat outputs/checkpoint_1/evaluation/stderr.txtimport json
with open("outputs/checkpoint_1/evaluation/report.json") as f:
report = json.load(f)
# Summary
print(f"Passed: {report['results']['summary']['passed']}")
print(f"Failed: {report['results']['summary']['failed']}")
# Individual tests
for test in report["results"]["tests"]:
print(f"{test['name']}: {test['status']}")# Navigate to workspace with tests
cd outputs/run_123/checkpoint_1
# Run pytest directly (similar to what PytestRunner does)
uvx \
--with=pytest \
--with=pytest-json-ctrf \
--with=pytest-json-report \
pytest \
--entrypoint='python main.py' \
--checkpoint='checkpoint_1' \
-vv \
.evaluation_tests/import logging
logging.basicConfig(level=logging.DEBUG)
from slop_code.evaluation import run_checkpoint_pytest
results = run_checkpoint_pytest(...)# Verify tests were copied
ls -la outputs/checkpoint_1/.evaluation_tests/
# Should contain:
# - conftest.py
# - test_checkpoint_1.py
# - (possibly test_checkpoint_0.py if include_prior_tests=true)Fix: Add __init__.py or check test directory structure
touch tests/__init__.py # Sometimes neededFix: Define fixture in conftest.py
@pytest.fixture(scope="session")
def entrypoint_argv(request):
return shlex.split(request.config.getoption("--entrypoint"))Fix: Add pytest_addoption in conftest.py
def pytest_addoption(parser):
parser.addoption("--entrypoint", action="store", required=True)
parser.addoption("--checkpoint", action="store", required=True)Fix: Check test function names start with test_
# Wrong
def check_something():
...
# Correct
def test_something():
...Fix: Check entrypoint configuration
# Problem config
entry_file: main.py
# Environment config
commands:
command: python
entry_file: "{entry_file}"-
Reduce test count for development
pytest -k "test_basic" .evaluation_tests/ -
Increase parallelization (if tests are independent)
test_dependencies: - "pytest-xdist" # Then use: pytest -n auto
-
Check for slow submission startup
-
Profile tests
pytest --memprof .evaluation_tests/
-
Reduce test data size
-
Use fixtures with proper scope
@pytest.fixture(scope="session") # Not "function" def expensive_data(): return load_data()
When reporting issues, include:
- Exit code and infrastructure_failure status
- Pytest stdout and stderr (from evaluation/ directory)
- Problem config (especially test_dependencies, markers)
- Checkpoint config (timeout, include_prior_tests)
- Test file structure (list of files in tests/)
- Environment (Python version, Docker version if applicable)
Issue: Tests not collected
Exit code: 5
infrastructure_failure: true
pytest_collected: 0
Directory structure:
problems/my_problem/
├── config.yaml
└── tests/
├── conftest.py
└── test_checkpoint_1.py
stderr:
ModuleNotFoundError: No module named 'custom_utils'
config.yaml:
test_dependencies: [] # Missing custom_utils
- Understand architecture: Architecture Guide
- Check configuration: Configuration Guide
- Interpret results: Reporting Guide