Evaluation of Text-to-SQL agent performance on Brazilian healthcare data (DATASUS/SIH-RS) using LLaMA 3.1:8b.
Dataset: 59 ground truth queries across three difficulty levels
- EASY (25): Single table, basic filters, COUNT operations
- MEDIUM (19): Multi-table JOINs, GROUP BY, aggregations
- HARD (15): Complex JOINs, subqueries, mortality calculations
Model: Ollama LLaMA 3.1:8b Evaluation Date: November 3, 2025 Total Execution Time: 531.1s (9.0s avg per query)
Three metrics following Spider benchmark methodology:
Execution Accuracy (EX) - Primary metric
- Percentage of queries returning correct results
- Threshold: Score ≥ 0.8 considered correct
- Most critical indicator of agent performance
Component Matching (CM)
- Structural similarity between generated and ground truth SQL
- Components: SELECT, FROM, WHERE, GROUP BY, ORDER BY, JOIN, nested queries
- Measures semantic understanding despite different formulation
Exact Match (EM)
- Binary match between generated and ground truth SQL
- Least critical - multiple valid SQL queries can produce same results
| Metric | Score |
|---|---|
| Agent Success Rate | 100.0% |
| Execution Accuracy (EX) | 81.6% |
| Component Matching (CM) | 61.6% |
| Exact Match (EM) | 16.9% |
High EX (81.6%) with low EM (16.9%) indicates the agent generates semantically correct but syntactically different SQL queries.
| Difficulty | Questions | Success Rate | EX | CM | EM |
|---|---|---|---|---|---|
| EASY | 25 | 100.0% (25/25) | 96.0% | 75.0% | 16.0% |
| MEDIUM | 19 | 100.0% (19/19) | 78.9% | 61.1% | 15.8% |
| HARD | 15 | 100.0% (15/15) | 60.0% | 43.3% | 6.7% |
EASY Queries
- Excellent execution accuracy (96.0%)
- Strong component matching (75.0%)
- Covers: Single table operations, basic filtering, COUNT aggregations, MIN/MAX operations
MEDIUM Queries
- Good execution accuracy (78.9%)
- Moderate component matching (61.1%)
- Covers: Multi-table JOINs, GROUP BY operations, temporal filters
HARD Queries
- Moderate execution accuracy (60.0%)
- Lower component matching (43.3%)
- Covers: Mortality rate calculations, nested subqueries, complex temporal analysis
Shows EX (81.6%) significantly exceeds EM (16.9%), indicating semantically correct but syntactically different SQL generation.
Demonstrates performance degradation with increasing complexity. Gap between EX and EM widens for HARD queries.
Overall success rate: 100.0% (59/59 queries). Distribution shows strong performance across all difficulty levels.
Low EM with High EX
The disparity between EM (16.9%) and EX (81.6%) is expected and indicates SQL flexibility. Multiple syntactically different queries can produce identical results.
Example:
-- Ground Truth (EM = 0)
SELECT COUNT(*) FROM internacoes i
LEFT JOIN mortes m ON i."N_AIH" = m."N_AIH"
WHERE m."N_AIH" IS NULL
-- Agent Generated (EX = 1, EM = 0, CM = 0.7)
SELECT COUNT(*) FROM internacoes
WHERE "N_AIH" NOT IN (SELECT "N_AIH" FROM mortes)Both queries return identical results despite different formulations.
HARD Query Error Patterns
Analysis reveals systematic challenges:
- Mortality rate calculations requiring percentage operations
- Multi-table temporal JOINs with complex date filtering
- Nested aggregations within subqueries
- CID-10 disease code pattern matching
# Run full evaluation
python evaluation/run_dag_evaluation.py
# Run by difficulty level
python evaluation/run_dag_evaluation.py --difficulty EASY
# Generate visualizations and report
python evaluation/generate_report.pyPlease refer to this work by citing the paper indicated below.
M. K. Moraes, I. Figueiredo, V. Marques, J. Santos, I. H. Manssour. From Questions to Answers: A Natural Language Interface for DATASUS Hospitalization Data. I Escola Regional de Aprendizado de Máquina e Inteligência Artificial da Região Sul (ERAMIA-RS), 2025, Porto Alegre/RS, Brasil. SBC, Porto Alegre, Brasil.


