A unified framework for LLM quality assessment and synthetic data generation
DiTing is a powerful AI evaluation and data synthesis platform designed specifically for Large Language Model (LLM) scenarios. It provides comprehensive evaluation metrics and high-quality synthetic data generation capabilities to help data scientists, AI engineers, and system developers better assess and improve AI system performance.
- π― Multi-dimensional Evaluation: Supports answer correctness, relevancy, similarity, context precision, recall, and more
- π§ Modular Architecture: Separation of core engine (diting-core) and API server (diting-server)
- π High-quality Data Synthesis: Generate high-quality evaluation datasets from raw corpora
- π Flexible Extension: Support for custom evaluation metrics and synthesis methods
- π High-performance API: High-performance web service built on FastAPI
- π Complete Type Support: Data validation and type checking using Pydantic
diting/
βββ packages/
β βββ diting-core/ # Core evaluation & synthesis engine
β β βββ src/diting_core/
β β β βββ callbacks/ # Callback system
β β β βββ cases/ # Test case definitions
β β β βββ metrics/ # Evaluation metrics library
β β β βββ models/ # LLM and Embedding models
β β β βββ synthesis/ # Data synthesis modules
β β β βββ utilities/ # Utility functions
β β βββ pyproject.toml
β βββ diting-server/ # Web API Server
β βββ src/diting_server/
β β βββ apis/v1/ # API route definitions
β β βββ common/ # Common components
β β βββ config/ # Configuration management
β β βββ exceptions/ # Exception handling
β β βββ services/ # Business logic services
β β βββ main.py # Application entry point
β βββ pyproject.toml
βββ tests/ # Test suites
βββ Dockerfile # Docker build file
βββ Makefile # Build scripts
βββ pyproject.toml # Project configuration
diting-core is the core engine of the DiTing platform, providing a complete evaluation metrics library and data synthesis functionality. It is an independent Python package that can be directly integrated into other projects as a library.
Core Architecture:
BaseMetric: Abstract base class for all evaluation metricsMetricValue: Standardized evaluation result format- Callback system: Support for evaluation process monitoring and logging
Built-in Evaluation Metrics:
-
Answer Correctness
- Evaluates factual accuracy and semantic similarity between actual and expected outputs
- Based on LLM judgment, combining F1 score and similarity calculations
- Range: [0, 1], higher is better
-
Answer Similarity
- Calculates cosine similarity between actual and expected outputs using embeddings
- Suitable for evaluating semantically equivalent but lexically different answers
- Range: [0, 1], higher is better
-
Answer Relevancy
- Evaluates how relevant the generated answer is to the user input question
- Based on question generation and similarity calculation
- Range: [0, 1], higher is better
-
Faithfulness
- Measures factual consistency between answers and retrieved context
- Specifically designed for RAG systems
- Range: [0, 1], higher indicates more faithful to context
-
Context Recall
- Evaluates completeness of relevant information in retrieved context
- Calculates proportion of reference answer claims supported by retrieved context
- Range: [0, 1], higher indicates more complete retrieval
-
Context Precision
- Evaluates proportion of relevant information in retrieved context
- Uses Precision@k to calculate average precision
- Range: [0, 1], higher indicates better retrieval quality
-
QA Quality
- Comprehensive evaluation of question-answer pair quality
- Combines multiple dimensions for overall scoring
-
RAG Runtime
- Runtime evaluation specifically for RAG systems
- Considers both retrieval and generation quality
-
Custom Metric
- Supports user-defined evaluation logic
- Define evaluation criteria through prompts
Core Architecture:
BaseSynthesizer: Abstract base class for data synthesizersBaseCorpus: Corpus management and processing- Quality control: Built-in quality assessment and filtering mechanisms
QA Synthesizer:
- Generate high-quality question-answer pairs from given context and themes
- Support quality assessment and automatic retry mechanisms
- Configurable generation quantity and concurrency control
- Built-in quality threshold filtering
- Support batch generation and quality ranking
LLM Model Support:
- OpenAI GPT series
- Other OpenAI API-compatible models
- Extensible model factory pattern
Embedding Model Support:
- OpenAI Embeddings
- Custom embedding model integration
- Optimized vector similarity computation
LLMCase: Standardized test case format- Support for multiple input/output types
- Metadata management and extension
diting-server is built on FastAPI, providing high-performance Web API interfaces for diting-core, supporting distributed deployment and production environment usage.
Service Features:
- High-performance asynchronous API based on FastAPI
- Complete type validation and automatic documentation generation
- Structured logging and monitoring support
- CORS support and security configuration
- Graceful error handling and exception management
Evaluation API:
- POST
/api/v1/evaluations/runs- Execute evaluation tasks- Support multiple LLM and Embedding model configurations
- Real-time return of evaluation results and token usage
- Support custom evaluation metrics
- Provide detailed execution logs
Data Synthesis API:
- POST
/api/v1/dataset-synthesis/runs- Execute data synthesis tasks- Support batch generation and metadata management
- Provide detailed generation logs and usage statistics
- Support quality control and filtering
- Chunk processing and progress tracking
Health Check API:
- GET
/healthz- Service health status check- Service availability monitoring
- Dependency service status check
Configuration Management:
- Environment variable configuration
- Dynamic configuration loading
- Model configuration management
Logging System:
- Structured log output
- Request tracing and performance monitoring
- Error reporting and debugging information
Exception Handling:
- Unified exception handling mechanism
- User-friendly error response format
- Detailed error information and suggestions
Evaluation Service:
- Evaluation task orchestration and execution
- Model configuration parsing and validation
- Token usage statistics and billing
Synthesis Service:
- Data synthesis task management
- Quality control and filtering
- Batch processing and concurrency control
- Backend Framework: Python 3.11+ + FastAPI
- Dependency Management: uv
- Data Validation: Pydantic
- Code Quality: Ruff (formatting) + MyPy/Pyright (type checking)
- Testing Framework: Pytest + Coverage
- Containerization: Docker
- Development Tools: Pre-commit hooks
- Python 3.11+
- uv package manager
- Make build tool
- Docker (optional, for containerized deployment)
-
Clone the project
git clone <repository-url> cd diting
-
Install dependencies
make install
This will automatically install all dependencies and configure pre-commit hooks.
-
Sync dependencies (optional)
make sync
# Run all tests
make test
# Generate coverage report
make testcov# Format code
make format
# Lint code
make lint
# Type check
make typecheck
# Run all quality checks
make all# Method 1: Direct run
cd packages/diting-server
uvicorn diting_server.main:app --reload --host 0.0.0.0 --port 3000
# Method 2: Use Python module
python -m diting_server.mainThe server will start at http://localhost:3000, and API documentation can be accessed at http://localhost:3000/docs.
-
Build image
docker build -t diting . -
Run container
docker run -p 3000:3000 diting
version: '3.8'
services:
diting:
image: diting:latest
ports:
- "3000:3000"
environment:
- LOG_LEVEL=INFO
- LOG_FORMAT=json
restart: unless-stopped# Install production dependencies
uv sync --no-dev
# Start service
uvicorn diting_server.main:app --host 0.0.0.0 --port 3000 --workers 4import requests
# Evaluation request
request_data = {
"llmConfig": {
"name": "gpt-3.5-turbo",
"baseUrl": "https://api.openai.com/v1",
"apiKey": "your-api-key"
},
"metricConfig": {
"metricName": "answer_correctness",
"metricType": "builtin_metric"
},
"evalCase": {
"userInput": "What is artificial intelligence?",
"actualOutput": "AI is a branch of computer science",
"expectedOutput": "Artificial intelligence is technology that simulates human intelligence"
}
}
response = requests.post(
"http://localhost:3000/api/v1/evaluations/runs",
json=request_data
)
result = response.json()
print(f"Evaluation score: {result['data']['score']}")
print(f"Evaluation reason: {result['data']['reason']}")import requests
# Data synthesis request
synthesis_request = {
"llmConfig": {
"name": "gpt-3.5-turbo",
"baseUrl": "https://api.openai.com/v1",
"apiKey": "your-api-key"
},
"synthesizerConfig": {
"synthesizerName": "q_a_synthesizer"
},
"inputData": {
"context": ["Artificial intelligence is an important branch of computer science..."],
"themes": ["AI basic concepts"]
}
}
response = requests.post(
"http://localhost:3000/api/v1/dataset-synthesis/runs",
json=synthesis_request
)
result = response.json()
print(f"Generated question: {result['data']['qaPair']['question']}")
print(f"Generated answer: {result['data']['qaPair']['answer']}")LOG_LEVEL: Log level (DEBUG, INFO, WARNING, ERROR)LOG_FORMAT: Log format (text, json)HOST: Service bind address (default: 0.0.0.0)PORT: Service port (default: 3000)
Support for multiple LLM and Embedding models:
- OpenAI GPT series
- Other OpenAI API-compatible models
- Custom model integration
- Use Ruff for code formatting
- Follow type annotation standards
- Write comprehensive unit tests
- Run
make allto check code quality before committing
DiTing - Making AI evaluation and data synthesis simpler and more reliable.