End-to-end fraud detection pipeline built with Python. It ingests transaction data, engineers behavioral features, trains a calibrated model, scores transactions, and serves an interactive Dash dashboard for monitoring and investigation.
| Skill Area | Implementation |
|---|---|
| Data Engineering | KaggleHub ingestion, normalized SQLite/PostgreSQL schema, staging layer, idempotent pipeline |
| SQL | Star-schema tables, analytical queries, indexing strategy, feature SQL references |
| Feature Engineering | Rolling windows, time-delta features, merchant risk encoding, geo mismatch/new-location signals |
| ML Engineering | RandomForest + probability calibration, chronological split, threshold tuning, artifact versioning |
| MLOps | Scheduled daily runs (APScheduler), reproducible Makefile workflow, persisted metrics and reports |
| Data Validation | Automated quality checks with JSON/Markdown reports and non-zero exit on critical failures |
| BI / Visualization | Dash + Plotly + Mantine UI, DuckDB-backed parquet querying, filter-driven investigation workflows |
| Software Engineering | Pydantic settings, structured logging, pytest suite, modular package layout |
# 1. Clone / navigate to project
cd fraud-detection-system
# 2. Create virtual environment and install
python3 -m venv .venv && source .venv/bin/activate
make install
# 3. Configure credentials
cp .env.example .env
# Edit .env and set KAGGLE_USERNAME + KAGGLE_KEY
# 4. Run the full pipeline
make pipeline
# 5. Launch dashboard
make dash
# Open http://localhost:8050make pipeline runs:
build-db -> ingest -> validate -> features -> train -> score
Copy .env.example to .env and populate values as needed.
| Variable | Required | Description |
|---|---|---|
KAGGLE_USERNAME |
Yes | Kaggle username |
KAGGLE_KEY |
Yes | Kaggle API key |
DB_URL |
No | SQLAlchemy URL (default: sqlite:///data/fraud_detection.db) |
DASHBOARD_PORT |
No | Dashboard port (default: 8050) |
FRAUD_THRESHOLD |
No | Probability cutoff override (default: 0.5; auto-selected at train time) |
TOP_ALERTS_N |
No | Number of top alerts exported (default: 100) |
AMOUNT_MIN |
No | Validation lower bound (default: 0.0) |
AMOUNT_MAX |
No | Validation upper bound (default: 1000000.0) |
EMAIL_ALERTS_ENABLED |
No | Enable SMTP alerts (true/false) |
SMTP_HOST |
No | SMTP host |
SMTP_PORT |
No | SMTP port (default: 587) |
SMTP_USER |
No | SMTP username / sender |
SMTP_PASSWORD |
No | SMTP password |
ALERT_EMAIL_TO |
No | Alert recipient |
Never commit .env.
The repository ignores .env, database files, and raw data artifacts.
- Go to Kaggle -> Settings -> API -> Create New Token.
- Read
usernameandkeyfromkaggle.json. - Export them or place them in
.env:
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_keyIf KAGGLE_API_TOKEN is set without KAGGLE_USERNAME and KAGGLE_KEY, the app warns and requires the two explicit variables.
Credentials are never printed in logs or written to result artifacts.
make setup # First-time setup (install deps + create dirs)
make build-db # Initialize DB schema (idempotent)
make ingest # Download/load dataset via KaggleHub
make validate # Run data quality checks
make features # Build train/score feature parquet files
make train # Train calibrated RandomForest model
make score # Score transactions and export alerts
make dash # Launch dashboard
make pipeline # Full end-to-end run
make daily # Daily scoring pipeline (no retrain)
make scheduler # Start APScheduler daemon
make test # Run pytest suite
make test-cov # Run tests with coverage
make lint # Ruff lint checks
make clean # Remove generated artifacts (keep raw data)
make clean-all # Remove generated artifacts + raw cachefraud-detection-system/
├── data/ Raw dataset cache (gitignored)
│ └── raw/ Cached CSVs from Kaggle
├── sql/
│ ├── schema.sql Normalized DB schema
│ ├── analysis_queries.sql Analytical SQL queries
│ └── features.sql Feature engineering SQL references
├── ml/
│ ├── build_features.py Feature engineering pipeline
│ ├── train_model.py Model training + calibration + metrics
│ └── score_transactions.py Scoring + alert exports + dashboard extracts
├── dashboard/
│ ├── app.py Dash application
│ ├── data_access.py DuckDB/parquet data access layer
│ └── assets/ Custom dashboard styling
├── automation/
│ ├── run_daily.py Daily orchestration entrypoint
│ └── scheduler.py APScheduler service
├── documentation/
│ ├── ARCHITECTURE.md System architecture notes
│ └── METHODOLOGY.md Modeling and feature methodology
├── results/ Generated artifacts
│ ├── validation/
│ ├── features/
│ ├── model/
│ └── scoring/
├── src/
│ ├── config.py Central settings and logging config
│ ├── ingest/
│ ├── db/
│ └── validation/
├── tests/
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── Makefile
└── README.md
# Ensure KAGGLE_USERNAME and KAGGLE_KEY are set in .env
make clean
make pipelineExpected artifacts:
results/validation/validation_report.{json,md}results/features/features_{train,score}.parquetresults/model/model.joblibresults/model/metrics.jsonresults/model/model_report.mdresults/scoring/scored_transactions.parquetresults/scoring/top_alerts.csv
Reproducibility controls:
random_state=42- chronological 80/20 split
- train-only statistics for leakage-sensitive features
- feature-name alignment embedded in model artifact
Set DB_URL in .env:
DB_URL=postgresql+psycopg2://user:password@localhost:5432/fraud_dbThen run:
make build-dbThe ingestion, feature, training, and scoring code uses SQLAlchemy and remains DB-agnostic.
| Symptom | Fix |
|---|---|
kagglehub download failed |
Verify KAGGLE_USERNAME and KAGGLE_KEY |
KAGGLE_API_TOKEN detected warning |
Set KAGGLE_USERNAME + KAGGLE_KEY |
403 Forbidden |
Accept dataset terms on Kaggle |
No internet access |
Check outbound HTTPS access |
No CSV files found |
Remove data/raw/ and rerun make ingest |
make test
make test-cov
# Individual modules
pytest tests/test_db_schema.py -v
pytest tests/test_validation.py -v
pytest tests/test_features.py -v
pytest tests/test_scoring.py -vTests run with in-memory SQLite and synthetic fixtures; Kaggle credentials are not required.
The dashboard runs on a local server (http://localhost:8050) and provides real-time oversight of transaction risks.
Fraud Monitoring & Score Distribution

# Manual daily run
make daily
# APScheduler daemon (03:00 UTC)
make scheduler
# Cron alternative
0 3 * * * cd /path/to/fraud-detection-system && .venv/bin/python -m automation.run_daily
