Text-to-SQL Datasus Agent Evaluation

Evaluation of Text-to-SQL agent performance on Brazilian healthcare data (DATASUS/SIH-RS) using LLaMA 3.1:8b.

Overview

Dataset: 59 ground truth queries across three difficulty levels

EASY (25): Single table, basic filters, COUNT operations
MEDIUM (19): Multi-table JOINs, GROUP BY, aggregations
HARD (15): Complex JOINs, subqueries, mortality calculations

Model: Ollama LLaMA 3.1:8b Evaluation Date: November 3, 2025 Total Execution Time: 531.1s (9.0s avg per query)

Evaluation Metrics

Three metrics following Spider benchmark methodology:

Execution Accuracy (EX) - Primary metric

Percentage of queries returning correct results
Threshold: Score ≥ 0.8 considered correct
Most critical indicator of agent performance

Component Matching (CM)

Structural similarity between generated and ground truth SQL
Components: SELECT, FROM, WHERE, GROUP BY, ORDER BY, JOIN, nested queries
Measures semantic understanding despite different formulation

Exact Match (EM)

Binary match between generated and ground truth SQL
Least critical - multiple valid SQL queries can produce same results

Results

Summary

Metric	Score
Agent Success Rate	100.0%
Execution Accuracy (EX)	81.6%
Component Matching (CM)	61.6%
Exact Match (EM)	16.9%

High EX (81.6%) with low EM (16.9%) indicates the agent generates semantically correct but syntactically different SQL queries.

Performance Analysis

By Difficulty Level

Difficulty	Questions	Success Rate	EX	CM	EM
EASY	25	100.0% (25/25)	96.0%	75.0%	16.0%
MEDIUM	19	100.0% (19/19)	78.9%	61.1%	15.8%
HARD	15	100.0% (15/15)	60.0%	43.3%	6.7%

Performance by Query Complexity

EASY Queries

Excellent execution accuracy (96.0%)
Strong component matching (75.0%)
Covers: Single table operations, basic filtering, COUNT aggregations, MIN/MAX operations

MEDIUM Queries

Good execution accuracy (78.9%)
Moderate component matching (61.1%)
Covers: Multi-table JOINs, GROUP BY operations, temporal filters

HARD Queries

Moderate execution accuracy (60.0%)
Lower component matching (43.3%)
Covers: Mortality rate calculations, nested subqueries, complex temporal analysis

Visualizations

Metrics Comparison

Shows EX (81.6%) significantly exceeds EM (16.9%), indicating semantically correct but syntactically different SQL generation.

Performance by Difficulty

Demonstrates performance degradation with increasing complexity. Gap between EX and EM widens for HARD queries.

Success Rate Distribution

Overall success rate: 100.0% (59/59 queries). Distribution shows strong performance across all difficulty levels.

Analysis

Low EM with High EX

The disparity between EM (16.9%) and EX (81.6%) is expected and indicates SQL flexibility. Multiple syntactically different queries can produce identical results.

Example:

-- Ground Truth (EM = 0)
SELECT COUNT(*) FROM internacoes i
LEFT JOIN mortes m ON i."N_AIH" = m."N_AIH"
WHERE m."N_AIH" IS NULL

-- Agent Generated (EX = 1, EM = 0, CM = 0.7)
SELECT COUNT(*) FROM internacoes
WHERE "N_AIH" NOT IN (SELECT "N_AIH" FROM mortes)

Both queries return identical results despite different formulations.

HARD Query Error Patterns

Analysis reveals systematic challenges:

Mortality rate calculations requiring percentage operations
Multi-table temporal JOINs with complex date filtering
Nested aggregations within subqueries
CID-10 disease code pattern matching

Running Evaluation

# Run full evaluation
python evaluation/run_dag_evaluation.py

# Run by difficulty level
python evaluation/run_dag_evaluation.py --difficulty EASY

# Generate visualizations and report
python evaluation/generate_report.py

Citation

Please refer to this work by citing the paper indicated below.

M. K. Moraes, I. Figueiredo, V. Marques, J. Santos, I. H. Manssour. From Questions to Answers: A Natural Language Interface for DATASUS Hospitalization Data. I Escola Regional de Aprendizado de Máquina e Inteligência Artificial da Região Sul (ERAMIA-RS), 2025, Porto Alegre/RS, Brasil. SBC, Porto Alegre, Brasil.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.vscode		.vscode
agent		agent
database		database
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-SQL Datasus Agent Evaluation

Overview

Evaluation Metrics