Skip to content

AkeBoss-tech/clean-your-data

Repository files navigation

CSV Data Cleaning Agent

A web app for uploading CSV files, describing what you want cleaned in plain English, and getting back cleaned data plus the Python code that did it. Useful for data scientists, or students.

Example of graphing


How to start

First-time setup:

make setup

Then add your Gemini API key to backend/.env:

GEMINI_API_KEY=your_key_here

Get a key from Google AI Studio.

Run the app:

make dev

This starts the Python backend (port 8000) and the Next.js frontend (port 3000). Open http://localhost:3000 in your browser.

Run backend and frontend separately (e.g. in two terminals):

make start-backend   # backend only
make start-frontend  # frontend only

Or without Make:

# Terminal 1 - backend
cd backend && source venv/bin/activate && python main.py

# Terminal 2 - frontend  
cd frontend && npm run dev

Screenshots

Example of graphing Example of cleaning

Example of analysis


What it does

  • Upload CSVs: One or more files at a time.
  • Chat to clean: Say things like “remove rows with missing values,” “delete the notes column,” or “convert age to numeric.” A multi-agent system (orchestrator, exploration, code execution, documentation) uses Gemini to figure out what you mean and run pandas code.
  • Safe execution: Generated code runs in a sandbox; only pandas/numpy-style operations are allowed.
  • Instant analysis: After upload, you get row/column counts, missing value checks, basic stats, and a preview of the first few rows.
  • Suggestions: The app suggests actions (e.g. drop nulls, fill missing values, remove duplicate columns) based on the data; you can apply them with one click.
  • Download: Get the cleaned CSV and the Python script that produced it.

Architecture

Frontend (Next.js + React)
    |
API Routes (Next.js)
    |
Python Backend (FastAPI)
    |
Multi-Agent System (Gemini)
    |
Data Processing (Pandas)

Tech stack

  • Frontend: Next.js 14, TypeScript, Tailwind, React Dropzone, Papa Parse, Lucide icons.
  • Backend: FastAPI, Pandas, Google Gemini, Uvicorn.
  • Optional: Supabase for conversation history (schema in scripts/supabase_schema.sql).

Project structure

clean-your-data/
├── frontend/           # Next.js app
├── backend/            # FastAPI + agents
│   ├── agents/         # conversational, exploration, code execution, docs, etc.
│   └── main.py
├── scripts/            # setup, start_dev, Supabase schema
├── docs/               # specs, enhancements, ideas, logging, etc.
└── README.md

Setup details

Prerequisites: Python 3.8+, Node.js 18+.

Backend env (backend/.env):

GEMINI_API_KEY=...
SUPABASE_URL=...        # optional
SUPABASE_KEY=...        # optional

Frontend env (frontend/.env.local):

PYTHON_BACKEND_URL=http://localhost:8000
# Supabase vars optional

Supabase (optional): Create a project, run scripts/supabase_schema.sql, then set the env vars. The app works without it; only conversation persistence is skipped.


API overview

  • Backend: POST /process (main chat endpoint), GET /health.
  • Frontend proxy: POST /api/process, POST /api/suggestions.

Limitations

  • CSV only.
  • Code execution is limited to pandas/numpy.
  • Conversation state in memory unless Supabase is configured.
  • Set up for local development; harden before production.

Troubleshooting

Issue What to try
“Module not found” Activate the backend venv and pip install -r requirements.txt in backend/.
Frontend won’t start npm install in frontend/, use Node 18+.
Backend connection errors Ensure backend runs on 8000 and PYTHON_BACKEND_URL in frontend/.env.local is correct.
Gemini errors Check GEMINI_API_KEY and Google AI Studio quota.

Docs

Extra notes and specs live in docs/:

  • docs/specs.md – product/feature specs
  • docs/ENHANCEMENTS.md, docs/FINAL_ENHANCEMENTS.md
  • docs/ideas.md, docs/prompt.md
  • docs/LOGGING.md, docs/frontend.md

License

MIT. See the LICENSE file.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •