(For smooth execution without node-server look upto readme inside project folder)
A machine learning-powered phishing detection system that identifies malicious URLs using advanced feature extraction and classification. Built with a Python FastAPI ML service and Node.js Express backend.
- Overview
- Features
- Architecture
- Prerequisites
- Installation & Setup
- Running the Project
- API Endpoints
- Project Structure
- Dataset
- Model Training
- Troubleshooting
This project detects whether a given URL is benign, suspicious, or malicious using machine learning classification. The system:
- Extracts 30+ features from URLs without making network requests
- Uses XGBoost classifier for high-accuracy predictions
- Provides REST API endpoints for easy integration
- Returns confidence scores and detailed predictions
Risk Score Categories:
- 0-30%: Benign (safe)
- 30-60%: Suspicious (caution advised)
- 60-100%: Malicious (likely phishing)
✅ URL-based feature extraction - No external API calls or network requests
✅ Fast inference - Pre-loaded model, response in milliseconds
✅ Detailed predictions - Risk scores, labels, and extracted features
✅ Production-ready - FastAPI with health checks and error handling
✅ Scalable architecture - Separated ML service and API backend
✅ Easy deployment - Docker-compatible, minimal dependencies
┌─────────────────────────────────────────────────────────┐
│ Node.js Express Backend (Port 3000) │
│ (API Gateway) │
└────────────────────┬────────────────────────────────────┘
│
│ HTTP Request
│ POST /predict-url
▼
┌─────────────────────────────────────────────────────────┐
│ Python FastAPI ML Service (Port 8000) │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Feature Extraction (30+ URL features) │ │
│ └───────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ XGBoost Model (Pre-loaded & Cached) │ │
│ └───────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Risk Assessment & Response Generation │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Before starting, ensure you have installed:
To verify installations:
python --version
node --version
npm --versioncd /path/to/projectNavigate to the ml-service directory and create a virtual environment:
cd ml-service
python -m venv venvActivate the virtual environment:
On Linux/macOS:
source venv/bin/activateOn Windows:
venv\Scripts\activatepip install -r ../requirements.txtKey Python packages installed:
- fastapi - Web framework for ML service
- uvicorn - ASGI server
- xgboost - ML classification model
- pandas - Data processing
- numpy - Numerical computing
- scikit-learn - Machine learning utilities
- joblib - Model serialization
In a new terminal, navigate to the backend-node directory:
cd backend-node
npm installThis installs:
- express - Web framework
- axios - HTTP client for ML service communication
- cors - Cross-origin resource sharing middleware
If you don't have a pre-trained model (url_model.pkl), train it first:
cd ml-service
python train.pyThis will:
- Load the dataset from
datasets/url/phishing.csv - Extract URL features
- Train the XGBoost classifier
- Save the model to
url_model.pkl - Display training metrics and cross-validation scores
Expected output:
Dataset: 11000 URLs | Malicious: 5200 | Benign: 5800
Extracting features (URL-only, no network calls)...
0 / 11000
10000 / 11000
Done. Failed extractions: 2
Training XGBoost model with StratifiedKFold...
[Results and metrics displayed]
Model saved to url_model.pkl
In the ml-service directory (with venv activated), run:
uvicorn main:app --reload --port 8000Expected output:
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started server process [12345]
INFO: Waiting for connections...
The FastAPI service is now running. You can access:
- API:
http://localhost:8000 - Interactive Docs:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
In the backend-node directory, run:
node index.jsExpected output:
Node running on 3000
Endpoint: POST http://localhost:3000/predict-url
Request Body:
{
"url": "https://www.example.com/login"
}Response:
{
"url": "https://www.example.com/login",
"risk_score": 15.3,
"label": "benign",
"features": {
"url_length": 35,
"domain_length": 15,
"has_https": 1,
"num_dots": 2,
...
}
}Response Fields:
url- The analyzed URLrisk_score- Confidence percentage (0-100)label- Verdict:"benign","suspicious", or"malicious"features- All extracted features (can be removed for production)
Endpoint: GET http://localhost:8000/health
Response:
{
"status": "ok",
"features_loaded": 30
}project/
├── README.md # This file
├── requirements.txt # Python dependencies
│
├── ml-service/ # Python ML Service (FastAPI)
│ ├── venv/ # Virtual environment (created during setup)
│ ├── __pycache__/ # Compiled Python files
│ ├── main.py # FastAPI application & endpoints
│ ├── train.py # Model training script
│ ├── features.py # URL feature extraction logic
│ └── url_model.pkl # Pre-trained XGBoost model
│
├── backend-node/ # Node.js Express API Gateway
│ ├── node_modules/ # NPM packages (created during setup)
│ ├── package.json # Node.js dependencies
│ ├── package-lock.json # Dependency versions lock file
│ └── index.js # Express server & routing
│
└── datasets/ # Training data
├── apk/ # APK analysis datasets (not used)
└── url/
└── phishing.csv # Phishing URL dataset (~11k URLs)
Location: datasets/url/phishing.csv
Format:
| Column | Type | Description |
|---|---|---|
| URL | String | The target URL to classify |
| label | Integer | 0 = Benign, 1 = Malicious/Phishing |
Statistics:
- Total URLs: ~11,000
- Benign URLs: ~5,800
- Malicious URLs: ~5,200
- Balance: Approximately balanced dataset
The system extracts 30+ features from each URL:
Structural Features:
- URL, domain, path, and query lengths
- Character counts (dots, hyphens, underscores, slashes, etc.)
- Number of parameters and subdomains
Protocol Features:
- HTTPS presence
- IP-based domains
- Port information
- Special characters (@ sign, double slash, hex encoding, etc.)
Content Features:
- Suspicious TLDs (.tk, .ml, .ga, etc.)
- URL shorteners (bit.ly, tinyurl.com, etc.)
- Suspicious keywords (login, verify, bank, secure, etc.)
Ratio Features:
- Digit ratio in domain
- Special character ratio in URL
- Domain to URL length ratio
The train.py script handles the entire training pipeline:
- Load Data - Reads phishing.csv and cleans/normalizes
- Feature Extraction - Extracts 30+ features per URL
- Model Training - Trains XGBoost classifier with StratifiedKFold cross-validation
- Evaluation - Generates classification reports and confusion matrices
- Save Model - Persists model to
url_model.pklfor production use
cd ml-service
python train.py- Accuracy: 95-97%
- Precision: 94-96%
- Recall: 95-98%
Solution: Ensure virtual environment is activated and dependencies installed
source venv/bin/activate # Linux/macOS
pip install -r requirements.txtSolution: Ensure FastAPI service is running on port 8000
# In ml-service directory, check if running:
lsof -i :8000 # Show processes on port 8000Solution: Kill the process using the port or use a different port
# Kill process on port 8000
lsof -ti :8000 | xargs kill -9
# Or change port in index.js or main.pySolution: Train the model first
cd ml-service
python train.pySolution: Use absolute path or ensure Python 3.8+ is installed
/path/to/project/ml-service/venv/bin/python train.pyFor fastest setup, run these commands in order:
# 1. Setup Python environment
cd ml-service
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r ../requirements.txt
# 2. Train model (if needed)
python train_url_model.py
# 3. Test model
python test_model.py
## custom test-cases can be written inside this file
# 4. Start ML service (Terminal 1)
uvicorn main:app --reload --port 8000
# 5. Setup Node.js (Terminal 2)
cd ../backend-node
npm install
node index.js
# 6. Test API (Terminal 3)
curl -X POST http://localhost:3000/predict-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'ISC License
For issues or questions, ensure:
- ✅ Both services are running (FastAPI on 8000, Express on 3000)
- ✅ Virtual environment is activated
- ✅ All dependencies are installed
- ✅ Model file (
url_model.pkl) exists - ✅ Dataset exists at
datasets/url/phishing.csv
Last Updated: April 2026