Skip to content

NehaKotwal/sql-copilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQL Copilot AI-powered SQL query generator using fine-tuned Llama 3.1, LangChain RAG, and GPT-based retry mechanism.

Author: Neha Kotwal

Features

  • Natural Language to SQL: Convert plain English questions into executable SQL queries
  • Fine-tuned Llama 3.1: Uses LoRA adapters for efficient fine-tuning on SQL generation tasks
  • RAG-based Context: Retrieves relevant schema information using vector embeddings
  • GPT Retry Mechanism: Achieves ~95% execution success rate through intelligent query fixing
  • Multi-database Support: Works with PostgreSQL, MySQL andd SQLite
  • nteractive Mode: Real-time query generation and testing

Architecture

┌─────────────────┐
│  User Question  │
└────────┬────────┘
         │
         v
┌─────────────────────────┐
│  RAG Context Retriever  │  ← Retrieves relevant schema from vector DB
└────────┬────────────────┘
         │
         v
┌─────────────────────────┐
│  Llama 3.1 Generator    │  ← Fine-tuned with LoRA adapters
│  (via LangChain)        │
└────────┬────────────────┘
         │
         v
┌─────────────────────────┐
│  GPT Retry Mechanism    │  ← Validates and fixes queries
└────────┬────────────────┘
         │
         v
┌─────────────────┐
│   SQL Query     │
└─────────────────┘

Configuration

# OpenAI API Key (required for retry mechanism)
OPENAI_API_KEY=your_key_here

# Database connection
DB_TYPE=postgresql
DB_HOST=localhost
DB_PORT=5432
DB_NAME=your_database
DB_USER=your_username
DB_PASSWORD=your_password

# Model paths
LLAMA_MODEL_PATH=meta-llama/Meta-Llama-3.1-8B-Instruct
LORA_ADAPTER_PATH=./checkpoints/lora_adapter

# Generation parameters
MAX_RETRIES=3
TEMPERATURE=0.1
MAX_TOKENS=512
  1. Interactive Mode Start an interactive session:
python main.py --interactive

In interactive mode:

  • Enter natural language questions
  • Type schema to view database schema
  • Type tables to list all tables
  • Type quit to exit
  1. View Schema Display database schema:
python main.py --show-schema

Fine-tuning

Used the Spider dataset with 10,181 questions and 5,693 unique SQL queries across 200 databases. Download and prepare Spider dataset:

python scripts/download_spider.py --max-examples 2000

This will download and convert Spider dataset to the training format, creating training_data/spider_training_data.json with 2000 examples. Custom Training Data

[
  {
    "question": "Get all users",
    "sql": "SELECT * FROM users;",
    "schema": "Table: users\nColumns: id, name, email, created_at"
  },
  {
    "question": "Find top 5 customers by spending",
    "sql": "SELECT user_id, SUM(total) as spending FROM orders GROUP BY user_id ORDER BY spending DESC LIMIT 5;",
    "schema": "Table: orders\nColumns: id, user_id, total"
  }
]

Train Model

Fine-tune Llama 3.1 with Spider dataset:

python train.py \
  --data training_data/spider_training_data.json \
  --epochs 3 \
  --batch-size 4 \
  --output ./checkpoints

Validate Training Data

Check data quality before training:

python train.py --data training_data/your_data.json --validate-only

Project Structure

SqlCopilot/
├── src/
│   ├── database/          # Database connection and schema extraction
│   │   ├── db_connector.py
│   │   └── schema_extractor.py
│   ├── rag/              # RAG pipeline for context retrieval
│   │   ├── schema_vectorstore.py
│   │   └── context_retriever.py
│   ├── generator/        # Query generation and retry
│   │   ├── query_generator.py
│   │   └── retry_mechanism.py
│   ├── finetuning/       # Model fine-tuning
│   │   ├── data_processor.py
│   │   └── trainer.py
│   └── utils/            # Utilities and configuration
│       ├── config.py
│       └── logger.py
├── main.py               # Main CLI application
├── train.py              # Training script
├── requirements.txt      # Dependencies
└── .env.example         # Example configuration

Performance

Based on evaluation with 2000+ query pairs:

  • Execution Success Rate : ~95% with retry mechanism
  • Syntactic Validity: ~92% first attempt
  • Context Retrieval: RAG significantly improves accuracy for complex schemas

How It Works

  1. Schema Indexing (RAG)
  • Extracts database schema tables, columns, relationships
  • Converts to embeddings using sentence-transformers
  1. Query Generation
  • Retrieves relevant schema context based on user question
  • Passes context + question to fine-tuned Llama 3.1
  • Uses LangChain for prompt management
  1. Validation & Retry
  • Attempts to execute generated query
  • If it fails, GPT-4 analyzes error and fixes the query
  • Retries up to 3 times
  • Achieves high success rate through iterative refinement
  1. Fine-tuning
  • Uses LoRA (Low-Rank Adaptation) for efficient fine-tuning
  • Instruction tuning format optimized for Llama 3.1
  • 4-bit quantization reduces memory requirements
  • Can be trained on consumer GPUs

About

AI-powered SQL query generation copilot using LangChain RAG and a LoRA-fine-tuned Llama 3.1 model, providing reliable natural language to SQL conversion.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors