A focused, lightweight retriever evaluation framework with standard Information Retrieval metrics
Evret brings standard Information Retrieval metrics to your recommendation, RAG and search systems. Evaluate retrievers with Hit Rate, Recall, Precision, MRR, NDCG, and Average Precision in just a few lines of code. Built for simplicity, extensibility, and seamless integration with vector databases and AI frameworks.
Evret is a modern Python framework designed for evaluating retrieval systems in Information Retrieval pipelines and search applications. It provides:
- Standard IR Metrics: Hit Rate, Recall, Precision, MRR, NDCG, and Average Precision
- Judge-Based Matching: Token overlap, semantic, and LLM judges for text relevance
- Vector Database Support: Native adapters for Qdrant and other vector databases
- Framework Integration: Adapters for LangChain and LlamaIndex
pip install evretFor optional integrations:
pip install evret[all]
# Install specific integrations
pip install "evret[qdrant]"
pip install "evret[langchain]"
pip install "evret[semantic]"from evret import EvaluationDataset, Evaluator, HitRate, MRR, NDCG, TokenOverlapJudge
dataset = EvaluationDataset.from_json("eval_data.json")
evaluator = Evaluator(
retriever=my_retriever,
metrics=[HitRate(k=4), MRR(k=4), NDCG(k=4)],
judge=TokenOverlapJudge(min_tokens=2, overlap_ratio=0.6),
)
results = evaluator.evaluate(dataset)
print(results.summary())
# Optional: Export results
results.to_json("results.json")
results.to_csv("results.csv")Use uv for local development:
uv venv
source .venv/bin/activate
uv pip install -e .Install all optional integrations:
uv pip install -e ".[all]"Run the default test suite:
pytestSee tests/README.md for coverage areas, optional integration tests, and test setup notes.
Evret supports all standard Information Retrieval metrics:
| Metric | Description | Use Case |
|---|---|---|
| Hit Rate@k | % of queries with at least one relevant doc in top-k | Binary relevance, recall-focused |
| Recall@k | % of relevant docs found in top-k | Comprehensive retrieval |
| Precision@k | % of top-k results that are relevant | Precision-focused systems |
| MRR@k | Mean Reciprocal Rank of first relevant doc | Single-answer retrieval |
| NDCG@k | Normalized Discounted Cumulative Gain | Rank-aware binary relevance quality |
| Average Precision@k | Area under precision-recall curve | Overall ranking quality |
Create evaluation datasets with queries and expected answers for judge-based evaluation:
from evret import EvaluationDataset, QueryExample, DocumentExample
dataset = EvaluationDataset(
documents=[
DocumentExample(doc_id="doc_1", text="Python uses pip for packages."),
DocumentExample(doc_id="doc_2", text="Virtual environments isolate dependencies."),
],
queries=[
QueryExample(
query_id="q1",
query_text="How to install Python packages?",
expected_answers=["pip install"] # Judge matches this against retrieved text
)
]
)Load datasets from JSON or CSV files:
dataset = EvaluationDataset.from_json("eval_data.json")
dataset = EvaluationDataset.from_csv("eval_data.csv")For detailed dataset format documentation, classic IR evaluation with document IDs, and more examples, see the Dataset Format Guide
Judges decide whether retrieved text matches the expected text in your evaluation dataset. Evaluator uses TokenOverlapJudge() by default, and you can pass judge= when you want explicit matching behavior.
from evret import Evaluator, HitRate, Recall
from evret.judges import TokenOverlapJudge
evaluator = Evaluator(
retriever=my_retriever,
metrics=[HitRate(k=4), Recall(k=4)],
judge=TokenOverlapJudge(min_tokens=2, overlap_ratio=0.6),
)Use SemanticJudge for embedding similarity and LLMJudge for LLM-provider based judgment.
from evret.judges import LLMJudge, SemanticJudge
semantic_judge = SemanticJudge(threshold=0.75)
llm_judge = LLMJudge(provider="openai", model="gpt-4o-mini")from evret.retrievers import QdrantRetriever
retriever = QdrantRetriever(
collection_name="docs",
query_encoder=embed_query,
url="http://localhost:6333",
id_field="doc_id",
)from evret.integrations import LangChainRetrieverAdapter
# Wrap any Evret retriever for use in LangChain
lc_retriever = LangChainRetrieverAdapter(evret_retriever=retriever, k=5)
docs = lc_retriever.invoke("what is information retrieval?")Run the default suite:
pytestRun Docker-backed integration tests:
EVRET_RUN_INTEGRATION=1 pytest -m integrationMore details are in tests/README.md.
Evret docs use MkDocs with Material theme.
Install docs dependencies:
uv pip install -e ".[docs]"Run docs locally:
mkdocs serveBuild static docs:
mkdocs buildMIT License - see LICENSE for details.
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
If you use Evret in your research, please cite:
@software{evret2026,
title={Evret: A Focused Retriever Evaluation Framework},
author={lucifertrj},
year={2026},
}