Intelligent Fraud Detection & Business Intelligence System

End-to-end fraud detection pipeline built with Python. It ingests transaction data, engineers behavioral features, trains a calibrated model, scores transactions, and serves an interactive Dash dashboard for monitoring and investigation.

What this demonstrates

Skill Area	Implementation
Data Engineering	KaggleHub ingestion, normalized SQLite/PostgreSQL schema, staging layer, idempotent pipeline
SQL	Star-schema tables, analytical queries, indexing strategy, feature SQL references
Feature Engineering	Rolling windows, time-delta features, merchant risk encoding, geo mismatch/new-location signals
ML Engineering	RandomForest + probability calibration, chronological split, threshold tuning, artifact versioning
MLOps	Scheduled daily runs (APScheduler), reproducible Makefile workflow, persisted metrics and reports
Data Validation	Automated quality checks with JSON/Markdown reports and non-zero exit on critical failures
BI / Visualization	Dash + Plotly + Mantine UI, DuckDB-backed parquet querying, filter-driven investigation workflows
Software Engineering	Pydantic settings, structured logging, pytest suite, modular package layout

Quickstart (<= 8 commands)

# 1. Clone / navigate to project
cd fraud-detection-system

# 2. Create virtual environment and install
python3 -m venv .venv && source .venv/bin/activate
make install

# 3. Configure credentials
cp .env.example .env
# Edit .env and set KAGGLE_USERNAME + KAGGLE_KEY

# 4. Run the full pipeline
make pipeline

# 5. Launch dashboard
make dash
# Open http://localhost:8050

make pipeline runs: build-db -> ingest -> validate -> features -> train -> score

Environment Variables

Copy .env.example to .env and populate values as needed.

Variable	Required	Description
`KAGGLE_USERNAME`	Yes	Kaggle username
`KAGGLE_KEY`	Yes	Kaggle API key
`DB_URL`	No	SQLAlchemy URL (default: `sqlite:///data/fraud_detection.db`)
`DASHBOARD_PORT`	No	Dashboard port (default: `8050`)
`FRAUD_THRESHOLD`	No	Probability cutoff override (default: `0.5`; auto-selected at train time)
`TOP_ALERTS_N`	No	Number of top alerts exported (default: `100`)
`AMOUNT_MIN`	No	Validation lower bound (default: `0.0`)
`AMOUNT_MAX`	No	Validation upper bound (default: `1000000.0`)
`EMAIL_ALERTS_ENABLED`	No	Enable SMTP alerts (`true`/`false`)
`SMTP_HOST`	No	SMTP host
`SMTP_PORT`	No	SMTP port (default: `587`)
`SMTP_USER`	No	SMTP username / sender
`SMTP_PASSWORD`	No	SMTP password
`ALERT_EMAIL_TO`	No	Alert recipient

Security

Never commit .env.

The repository ignores .env, database files, and raw data artifacts.

Setting Kaggle credentials

Go to Kaggle -> Settings -> API -> Create New Token.
Read username and key from kaggle.json.
Export them or place them in .env:

export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

`KAGGLE_API_TOKEN` compatibility note

If KAGGLE_API_TOKEN is set without KAGGLE_USERNAME and KAGGLE_KEY, the app warns and requires the two explicit variables.

Credentials are never printed in logs or written to result artifacts.

Make Targets

make setup      # First-time setup (install deps + create dirs)
make build-db   # Initialize DB schema (idempotent)
make ingest     # Download/load dataset via KaggleHub
make validate   # Run data quality checks
make features   # Build train/score feature parquet files
make train      # Train calibrated RandomForest model
make score      # Score transactions and export alerts
make dash       # Launch dashboard
make pipeline   # Full end-to-end run
make daily      # Daily scoring pipeline (no retrain)
make scheduler  # Start APScheduler daemon
make test       # Run pytest suite
make test-cov   # Run tests with coverage
make lint       # Ruff lint checks
make clean      # Remove generated artifacts (keep raw data)
make clean-all  # Remove generated artifacts + raw cache

Repository Structure

fraud-detection-system/
├── data/                         Raw dataset cache (gitignored)
│   └── raw/                      Cached CSVs from Kaggle
├── sql/
│   ├── schema.sql                Normalized DB schema
│   ├── analysis_queries.sql      Analytical SQL queries
│   └── features.sql              Feature engineering SQL references
├── ml/
│   ├── build_features.py         Feature engineering pipeline
│   ├── train_model.py            Model training + calibration + metrics
│   └── score_transactions.py     Scoring + alert exports + dashboard extracts
├── dashboard/
│   ├── app.py                    Dash application
│   ├── data_access.py            DuckDB/parquet data access layer
│   └── assets/                   Custom dashboard styling
├── automation/
│   ├── run_daily.py              Daily orchestration entrypoint
│   └── scheduler.py              APScheduler service
├── documentation/
│   ├── ARCHITECTURE.md           System architecture notes
│   └── METHODOLOGY.md            Modeling and feature methodology
├── results/                      Generated artifacts
│   ├── validation/
│   ├── features/
│   ├── model/
│   └── scoring/
├── src/
│   ├── config.py                 Central settings and logging config
│   ├── ingest/
│   ├── db/
│   └── validation/
├── tests/
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── Makefile
└── README.md

How to Reproduce Results

# Ensure KAGGLE_USERNAME and KAGGLE_KEY are set in .env
make clean
make pipeline

Expected artifacts:

results/validation/validation_report.{json,md}
results/features/features_{train,score}.parquet
results/model/model.joblib
results/model/metrics.json
results/model/model_report.md
results/scoring/scored_transactions.parquet
results/scoring/top_alerts.csv

Reproducibility controls:

random_state=42
chronological 80/20 split
train-only statistics for leakage-sensitive features
feature-name alignment embedded in model artifact

Switching to PostgreSQL

Set DB_URL in .env:

DB_URL=postgresql+psycopg2://user:password@localhost:5432/fraud_db

Then run:

make build-db

The ingestion, feature, training, and scoring code uses SQLAlchemy and remains DB-agnostic.

Troubleshooting Kaggle Auth

Symptom	Fix
`kagglehub download failed`	Verify `KAGGLE_USERNAME` and `KAGGLE_KEY`
`KAGGLE_API_TOKEN detected` warning	Set `KAGGLE_USERNAME` + `KAGGLE_KEY`
`403 Forbidden`	Accept dataset terms on Kaggle
`No internet access`	Check outbound HTTPS access
`No CSV files found`	Remove `data/raw/` and rerun `make ingest`

Running Tests

make test
make test-cov

# Individual modules
pytest tests/test_db_schema.py -v
pytest tests/test_validation.py -v
pytest tests/test_features.py -v
pytest tests/test_scoring.py -v

Tests run with in-memory SQLite and synthetic fixtures; Kaggle credentials are not required.

Dashboard Interface

The dashboard runs on a local server (http://localhost:8050) and provides real-time oversight of transaction risks.

Executive Summary & KPI Grid

Fraud Monitoring & Score Distribution

Drill-Down & Investigation

Daily Automation

# Manual daily run
make daily

# APScheduler daemon (03:00 UTC)
make scheduler

# Cron alternative
0 3 * * * cd /path/to/fraud-detection-system && .venv/bin/python -m automation.run_daily

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intelligent Fraud Detection & Business Intelligence System

What this demonstrates

Quickstart (<= 8 commands)

Environment Variables

Security

Setting Kaggle credentials

`KAGGLE_API_TOKEN` compatibility note

Make Targets

Repository Structure

How to Reproduce Results

Switching to PostgreSQL

Troubleshooting Kaggle Auth

Running Tests

Dashboard Interface

Daily Automation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
automation		automation
dashboard		dashboard
documentation		documentation
ml		ml
results/scoring		results/scoring
sql		sql
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
image1.png		image1.png
image2.png		image2.png
image3.png		image3.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Intelligent Fraud Detection & Business Intelligence System

What this demonstrates

Quickstart (<= 8 commands)

Environment Variables

Security

Setting Kaggle credentials

KAGGLE_API_TOKEN compatibility note

Make Targets

Repository Structure

How to Reproduce Results

Switching to PostgreSQL

Troubleshooting Kaggle Auth

Running Tests

Dashboard Interface

Daily Automation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`KAGGLE_API_TOKEN` compatibility note

Packages