This repository contains the artifacts of the paper "A Conversational Agent for Natural Language Access to Public Health Data", submitted to CBMS 2026. To the best of our knowledge, we present the first Text-to-SQL system for the Hospital Information System in Reduced Data format (SIH-RD) from DATASUS, enabling queries over 18.7 million hospitalization records from the states of Rio Grande do Sul and Maranhão between 2008 and 2023. To address this, we derive fifteen domain-specific SQL generation rules from systematic SIH-RD schema analysis and embed them in a 9-stage LangGraph pipeline with query routing, automatic table selection, chain-of-thought planning, SQL generation, static validation, and bounded self-repair, requiring no model fine-tuning. We also construct a benchmark of 120 Portuguese healthcare queries stratified into Easy, Medium, and Hard tiers (40 each) with gold-standard SQL over records from the two Brazilian states (2008--2023), comprising the first formal Text-to-SQL evaluation on SIH-RD. The agent achieves 93.3% execution accuracy (112/120) with 100% pipeline completion; a controlled single-shot baseline sharing identical model, domain rules, and prompts achieves 90.0% (108/120), with the advantage concentrated in Hard queries (+10.0~pp), isolating the contribution of graph orchestration for complex multi-table reasoning.
Data was collected from SIH-RD via the PySUS library, in the Reduced Data format (RD) — the contracted version of DATASUS that enables processing on commodity local hardware. The data consists of public microdata containing no personally identifiable information (PII), ensuring compliance with the Brazilian General Data Protection Law (LGPD).
The database is modeled as a Snowflake Schema, optimized for OLAP analytical workloads, comprising:
| Prefix | Type | DATASUS Name | FHIR Name |
|---|---|---|---|
TF_ |
Fact | internacoes |
Encounter |
RL_ |
Bridge | atendimentos |
Procedure |
TD_ |
Dimension | hospital |
Organization |
TD_ |
Dimension | municipios |
Location |
TD_ |
Dimension | cid |
ICDCode |
TD_ |
Dimension | procedimentos |
ProcedureCode |
TD_ |
Dimension | especialidade |
SpecialtyCode |
TD_ |
Dimension | sexo |
AdministrativeSex |
TD_ |
Dimension | raca_cor |
RaceCode |
TD_ |
Dimension | etnia |
EthnicityCode |
TD_ |
Dimension | nacionalidade |
NationalityCode |
TD_ |
Dimension | instrucao |
EducationCode |
TD_ |
Dimension | vincprev |
SocialSecurityCode |
TD_ |
Dimension | contraceptivos |
ContraceptiveCode |
TD_ |
Dimension | tempo |
TimeDimension |
TD_ |
Analytical | socioeconomico |
SocioeconomicIndicator |
The benchmark comprises 120 Portuguese-language queries with gold-standard SQL, stratified into three difficulty tiers:
| Tier | N | Structural Criteria |
|---|---|---|
| Easy | 40 | Single table; basic filter or simple aggregation; no joins |
| Medium | 40 | Two-table FK join, or single-table temporal/grouped aggregation |
| Hard | 40 | ≥2 joins, multi-step reasoning, or computed metrics across dimensional hierarchies |
Queries span four analytical dimensions: aggregation, temporal filtering, geographic analysis, and multi-table joins.
The ground_truth.json file contains for each query: id, difficulty, question, query (gold-standard SQL), tables, and notes with relevant modeling decisions.
The agent is implemented in LangGraph across 9 sequential stages:
- Query Classification — routing: DATABASE, conversational, schema, or ambiguous
- Table Selection — identification of relevant tables via regex fast-path or LLM
- Schema Retrieval — fetches columns, types, FKs, and value ranges exclusively for selected tables
- CoT Planning — chain-of-thought planning for complex queries (per-group top-N, global vs. local averages, temporal comparisons)
- SQL Generation — synthesis using RULES A–O, SUS-specific value mappings, and per-table few-shot examples
- SQL Validation — static syntax and schema alignment check
- SQL Execution — query execution against DuckDB
- Self-Repair — error recovery: schema-level (rerouting to table selection) or syntax-level (targeted regeneration), up to 2 retries
- Response Generation — natural-language formatting; guara
| System | Easy (n=40) | Medium (n=40) | Hard (n=40) | Overall (n=120) |
|---|---|---|---|---|
| Prompt Baseline | 97.5% | 100.0% | 72.5% | 90.0% |
| LangGraph Agent | 100.0% | 97.5% | 82.5% | 93.3% |
| Δ (Agent − Baseline) | +2.5 pp | −2.5 pp | +10.0 pp | +3.3 pp |
McNemar's exact test: b=6, c=2, p=0.289. Wilson 95% CI: agent [87.4%, 96.6%], baseline [83.3%, 94.2%].
The agent and data pipeline source code will be made publicly available in this repository upon paper acceptance.
Artifacts currently available:
ground_truth.json— full benchmark with 120 gold-standard queries- evaluation reports
- Data dictionary and database schema
We are members of the Data Visualization and Interaction Lab (DaVInt) at PUCRS:
- Isabel H. Manssour -- Professor Coordinator of DaVInt -- 2017-current.
- Isadora Ferraz e Figueiredo -- Undergraduate Research Student -- 2025-current.
- Maicon Kevyn Moraes da Silva -- Undergraduate Research Student -- 2025
- Victória Cavalheiro Marques -- Undergraduate Research Student -- 2025-current.
