Skip to content

DAVINTLAB/DATASUSAnalytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Text-to-SQL Datasus Agent Evaluation

Evaluation of Text-to-SQL agent performance on Brazilian healthcare data (DATASUS/SIH-RS) using LLaMA 3.1:8b.

Overview

Dataset: 59 ground truth queries across three difficulty levels

  • EASY (25): Single table, basic filters, COUNT operations
  • MEDIUM (19): Multi-table JOINs, GROUP BY, aggregations
  • HARD (15): Complex JOINs, subqueries, mortality calculations

Model: Ollama LLaMA 3.1:8b Evaluation Date: November 3, 2025 Total Execution Time: 531.1s (9.0s avg per query)

Evaluation Metrics

Three metrics following Spider benchmark methodology:

Execution Accuracy (EX) - Primary metric

  • Percentage of queries returning correct results
  • Threshold: Score ≥ 0.8 considered correct
  • Most critical indicator of agent performance

Component Matching (CM)

  • Structural similarity between generated and ground truth SQL
  • Components: SELECT, FROM, WHERE, GROUP BY, ORDER BY, JOIN, nested queries
  • Measures semantic understanding despite different formulation

Exact Match (EM)

  • Binary match between generated and ground truth SQL
  • Least critical - multiple valid SQL queries can produce same results

Results

Summary

Metric Score
Agent Success Rate 100.0%
Execution Accuracy (EX) 81.6%
Component Matching (CM) 61.6%
Exact Match (EM) 16.9%

High EX (81.6%) with low EM (16.9%) indicates the agent generates semantically correct but syntactically different SQL queries.


Performance Analysis

By Difficulty Level

Difficulty Questions Success Rate EX CM EM
EASY 25 100.0% (25/25) 96.0% 75.0% 16.0%
MEDIUM 19 100.0% (19/19) 78.9% 61.1% 15.8%
HARD 15 100.0% (15/15) 60.0% 43.3% 6.7%

Performance by Query Complexity

EASY Queries

  • Excellent execution accuracy (96.0%)
  • Strong component matching (75.0%)
  • Covers: Single table operations, basic filtering, COUNT aggregations, MIN/MAX operations

MEDIUM Queries

  • Good execution accuracy (78.9%)
  • Moderate component matching (61.1%)
  • Covers: Multi-table JOINs, GROUP BY operations, temporal filters

HARD Queries

  • Moderate execution accuracy (60.0%)
  • Lower component matching (43.3%)
  • Covers: Mortality rate calculations, nested subqueries, complex temporal analysis

Visualizations

Metrics Comparison

Metrics Comparison

Shows EX (81.6%) significantly exceeds EM (16.9%), indicating semantically correct but syntactically different SQL generation.

Performance by Difficulty

Difficulty Breakdown

Demonstrates performance degradation with increasing complexity. Gap between EX and EM widens for HARD queries.

Success Rate Distribution

Success Rate

Overall success rate: 100.0% (59/59 queries). Distribution shows strong performance across all difficulty levels.

Analysis

Low EM with High EX

The disparity between EM (16.9%) and EX (81.6%) is expected and indicates SQL flexibility. Multiple syntactically different queries can produce identical results.

Example:

-- Ground Truth (EM = 0)
SELECT COUNT(*) FROM internacoes i
LEFT JOIN mortes m ON i."N_AIH" = m."N_AIH"
WHERE m."N_AIH" IS NULL

-- Agent Generated (EX = 1, EM = 0, CM = 0.7)
SELECT COUNT(*) FROM internacoes
WHERE "N_AIH" NOT IN (SELECT "N_AIH" FROM mortes)

Both queries return identical results despite different formulations.

HARD Query Error Patterns

Analysis reveals systematic challenges:

  • Mortality rate calculations requiring percentage operations
  • Multi-table temporal JOINs with complex date filtering
  • Nested aggregations within subqueries
  • CID-10 disease code pattern matching

Running Evaluation

# Run full evaluation
python evaluation/run_dag_evaluation.py

# Run by difficulty level
python evaluation/run_dag_evaluation.py --difficulty EASY

# Generate visualizations and report
python evaluation/generate_report.py

Citation

Please refer to this work by citing the paper indicated below.

M. K. Moraes, I. Figueiredo, V. Marques, J. Santos, I. H. Manssour. From Questions to Answers: A Natural Language Interface for DATASUS Hospitalization Data. I Escola Regional de Aprendizado de Máquina e Inteligência Artificial da Região Sul (ERAMIA-RS), 2025, Porto Alegre/RS, Brasil. SBC, Porto Alegre, Brasil.

About

Files of...

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors