Skip to content

daniel-st3/fraud-detection-system

Repository files navigation

Intelligent Fraud Detection & Business Intelligence System

End-to-end fraud detection pipeline built with Python. It ingests transaction data, engineers behavioral features, trains a calibrated model, scores transactions, and serves an interactive Dash dashboard for monitoring and investigation.


What this demonstrates

Skill Area Implementation
Data Engineering KaggleHub ingestion, normalized SQLite/PostgreSQL schema, staging layer, idempotent pipeline
SQL Star-schema tables, analytical queries, indexing strategy, feature SQL references
Feature Engineering Rolling windows, time-delta features, merchant risk encoding, geo mismatch/new-location signals
ML Engineering RandomForest + probability calibration, chronological split, threshold tuning, artifact versioning
MLOps Scheduled daily runs (APScheduler), reproducible Makefile workflow, persisted metrics and reports
Data Validation Automated quality checks with JSON/Markdown reports and non-zero exit on critical failures
BI / Visualization Dash + Plotly + Mantine UI, DuckDB-backed parquet querying, filter-driven investigation workflows
Software Engineering Pydantic settings, structured logging, pytest suite, modular package layout

Quickstart (<= 8 commands)

# 1. Clone / navigate to project
cd fraud-detection-system

# 2. Create virtual environment and install
python3 -m venv .venv && source .venv/bin/activate
make install

# 3. Configure credentials
cp .env.example .env
# Edit .env and set KAGGLE_USERNAME + KAGGLE_KEY

# 4. Run the full pipeline
make pipeline

# 5. Launch dashboard
make dash
# Open http://localhost:8050

make pipeline runs: build-db -> ingest -> validate -> features -> train -> score


Environment Variables

Copy .env.example to .env and populate values as needed.

Variable Required Description
KAGGLE_USERNAME Yes Kaggle username
KAGGLE_KEY Yes Kaggle API key
DB_URL No SQLAlchemy URL (default: sqlite:///data/fraud_detection.db)
DASHBOARD_PORT No Dashboard port (default: 8050)
FRAUD_THRESHOLD No Probability cutoff override (default: 0.5; auto-selected at train time)
TOP_ALERTS_N No Number of top alerts exported (default: 100)
AMOUNT_MIN No Validation lower bound (default: 0.0)
AMOUNT_MAX No Validation upper bound (default: 1000000.0)
EMAIL_ALERTS_ENABLED No Enable SMTP alerts (true/false)
SMTP_HOST No SMTP host
SMTP_PORT No SMTP port (default: 587)
SMTP_USER No SMTP username / sender
SMTP_PASSWORD No SMTP password
ALERT_EMAIL_TO No Alert recipient

Security

Never commit .env.

The repository ignores .env, database files, and raw data artifacts.

Setting Kaggle credentials

  1. Go to Kaggle -> Settings -> API -> Create New Token.
  2. Read username and key from kaggle.json.
  3. Export them or place them in .env:
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

KAGGLE_API_TOKEN compatibility note

If KAGGLE_API_TOKEN is set without KAGGLE_USERNAME and KAGGLE_KEY, the app warns and requires the two explicit variables.

Credentials are never printed in logs or written to result artifacts.


Make Targets

make setup      # First-time setup (install deps + create dirs)
make build-db   # Initialize DB schema (idempotent)
make ingest     # Download/load dataset via KaggleHub
make validate   # Run data quality checks
make features   # Build train/score feature parquet files
make train      # Train calibrated RandomForest model
make score      # Score transactions and export alerts
make dash       # Launch dashboard
make pipeline   # Full end-to-end run
make daily      # Daily scoring pipeline (no retrain)
make scheduler  # Start APScheduler daemon
make test       # Run pytest suite
make test-cov   # Run tests with coverage
make lint       # Ruff lint checks
make clean      # Remove generated artifacts (keep raw data)
make clean-all  # Remove generated artifacts + raw cache

Repository Structure

fraud-detection-system/
├── data/                         Raw dataset cache (gitignored)
│   └── raw/                      Cached CSVs from Kaggle
├── sql/
│   ├── schema.sql                Normalized DB schema
│   ├── analysis_queries.sql      Analytical SQL queries
│   └── features.sql              Feature engineering SQL references
├── ml/
│   ├── build_features.py         Feature engineering pipeline
│   ├── train_model.py            Model training + calibration + metrics
│   └── score_transactions.py     Scoring + alert exports + dashboard extracts
├── dashboard/
│   ├── app.py                    Dash application
│   ├── data_access.py            DuckDB/parquet data access layer
│   └── assets/                   Custom dashboard styling
├── automation/
│   ├── run_daily.py              Daily orchestration entrypoint
│   └── scheduler.py              APScheduler service
├── documentation/
│   ├── ARCHITECTURE.md           System architecture notes
│   └── METHODOLOGY.md            Modeling and feature methodology
├── results/                      Generated artifacts
│   ├── validation/
│   ├── features/
│   ├── model/
│   └── scoring/
├── src/
│   ├── config.py                 Central settings and logging config
│   ├── ingest/
│   ├── db/
│   └── validation/
├── tests/
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── Makefile
└── README.md

How to Reproduce Results

# Ensure KAGGLE_USERNAME and KAGGLE_KEY are set in .env
make clean
make pipeline

Expected artifacts:

  • results/validation/validation_report.{json,md}
  • results/features/features_{train,score}.parquet
  • results/model/model.joblib
  • results/model/metrics.json
  • results/model/model_report.md
  • results/scoring/scored_transactions.parquet
  • results/scoring/top_alerts.csv

Reproducibility controls:

  • random_state=42
  • chronological 80/20 split
  • train-only statistics for leakage-sensitive features
  • feature-name alignment embedded in model artifact

Switching to PostgreSQL

Set DB_URL in .env:

DB_URL=postgresql+psycopg2://user:password@localhost:5432/fraud_db

Then run:

make build-db

The ingestion, feature, training, and scoring code uses SQLAlchemy and remains DB-agnostic.


Troubleshooting Kaggle Auth

Symptom Fix
kagglehub download failed Verify KAGGLE_USERNAME and KAGGLE_KEY
KAGGLE_API_TOKEN detected warning Set KAGGLE_USERNAME + KAGGLE_KEY
403 Forbidden Accept dataset terms on Kaggle
No internet access Check outbound HTTPS access
No CSV files found Remove data/raw/ and rerun make ingest

Running Tests

make test
make test-cov

# Individual modules
pytest tests/test_db_schema.py -v
pytest tests/test_validation.py -v
pytest tests/test_features.py -v
pytest tests/test_scoring.py -v

Tests run with in-memory SQLite and synthetic fixtures; Kaggle credentials are not required.


Dashboard Interface

The dashboard runs on a local server (http://localhost:8050) and provides real-time oversight of transaction risks.

Executive Summary & KPI Grid Executive Summary

Fraud Monitoring & Score Distribution Fraud Monitoring

Drill-Down & Investigation Transaction Details


Daily Automation

# Manual daily run
make daily

# APScheduler daemon (03:00 UTC)
make scheduler

# Cron alternative
0 3 * * * cd /path/to/fraud-detection-system && .venv/bin/python -m automation.run_daily

About

End-to-end fraud detection pipeline and BI dashboard featuring a star-schema SQL database, calibrated RandomForest models, and automated reporting using Python, Scikit-learn, and Plotly Dash.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors