Android-Malware-Detection

Phishing URL Detector

(For smooth execution without node-server look upto readme inside project folder)

A machine learning-powered phishing detection system that identifies malicious URLs using advanced feature extraction and classification. Built with a Python FastAPI ML service and Node.js Express backend.

Overview

This project detects whether a given URL is benign, suspicious, or malicious using machine learning classification. The system:

Extracts 30+ features from URLs without making network requests
Uses XGBoost classifier for high-accuracy predictions
Provides REST API endpoints for easy integration
Returns confidence scores and detailed predictions

Risk Score Categories:

0-30%: Benign (safe)
30-60%: Suspicious (caution advised)
60-100%: Malicious (likely phishing)

Features

✅ URL-based feature extraction - No external API calls or network requests
✅ Fast inference - Pre-loaded model, response in milliseconds
✅ Detailed predictions - Risk scores, labels, and extracted features
✅ Production-ready - FastAPI with health checks and error handling
✅ Scalable architecture - Separated ML service and API backend
✅ Easy deployment - Docker-compatible, minimal dependencies

Architecture

┌─────────────────────────────────────────────────────────┐
│          Node.js Express Backend (Port 3000)            │
│                  (API Gateway)                          │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ HTTP Request
                     │ POST /predict-url
                     ▼
┌─────────────────────────────────────────────────────────┐
│      Python FastAPI ML Service (Port 8000)              │
│  ┌───────────────────────────────────────────────────┐  │
│  │ Feature Extraction (30+ URL features)             │  │
│  └───────────────────────────────────────────────────┘  │ 
│  ┌───────────────────────────────────────────────────┐  │
│  │ XGBoost Model (Pre-loaded & Cached)               │  │
│  └───────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────┐  │
│  │ Risk Assessment & Response Generation             │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Prerequisites

Before starting, ensure you have installed:

Python 3.8+ (Download)
Node.js 14+ (Download)
npm (comes with Node.js)

To verify installations:

python --version
node --version
npm --version

Installation & Setup

Step 1: Clone/Navigate to Project Directory

cd /path/to/project

Step 2: Set Up Python Virtual Environment

Navigate to the ml-service directory and create a virtual environment:

cd ml-service
python -m venv venv

Activate the virtual environment:

On Linux/macOS:

source venv/bin/activate

On Windows:

venv\Scripts\activate

Step 3: Install Python Dependencies

pip install -r ../requirements.txt

Key Python packages installed:

fastapi - Web framework for ML service
uvicorn - ASGI server
xgboost - ML classification model
pandas - Data processing
numpy - Numerical computing
scikit-learn - Machine learning utilities
joblib - Model serialization

Step 4: Install Node.js Dependencies

In a new terminal, navigate to the backend-node directory:

cd backend-node
npm install

This installs:

express - Web framework
axios - HTTP client for ML service communication
cors - Cross-origin resource sharing middleware

Running the Project

Step 1: Train the Model (Optional - One-time Setup)

If you don't have a pre-trained model (url_model.pkl), train it first:

cd ml-service
python train.py

This will:

Load the dataset from datasets/url/phishing.csv
Extract URL features
Train the XGBoost classifier
Save the model to url_model.pkl
Display training metrics and cross-validation scores

Expected output:

Dataset: 11000 URLs  |  Malicious: 5200  |  Benign: 5800
Extracting features (URL-only, no network calls)...
  0 / 11000
  10000 / 11000
Done. Failed extractions: 2

Training XGBoost model with StratifiedKFold...
[Results and metrics displayed]

Model saved to url_model.pkl

Step 2: Start the ML Service (FastAPI)

In the ml-service directory (with venv activated), run:

uvicorn main:app --reload --port 8000

Expected output:

INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started server process [12345]
INFO:     Waiting for connections...

The FastAPI service is now running. You can access:

API: http://localhost:8000
Interactive Docs: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Step 3: Start the Node.js Backend (New Terminal)

In the backend-node directory, run:

node index.js

Expected output:

Node running on 3000

API Endpoints

1. Predict URL (Primary Endpoint)

Endpoint: POST http://localhost:3000/predict-url

Request Body:

{
  "url": "https://www.example.com/login"
}

Response:

{
  "url": "https://www.example.com/login",
  "risk_score": 15.3,
  "label": "benign",
  "features": {
    "url_length": 35,
    "domain_length": 15,
    "has_https": 1,
    "num_dots": 2,
    ...
  }
}

Response Fields:

url - The analyzed URL
risk_score - Confidence percentage (0-100)
label - Verdict: "benign", "suspicious", or "malicious"
features - All extracted features (can be removed for production)

2. Health Check

Endpoint: GET http://localhost:8000/health

Response:

{
  "status": "ok",
  "features_loaded": 30
}

Project Structure

project/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
│
├── ml-service/                        # Python ML Service (FastAPI)
│   ├── venv/                          # Virtual environment (created during setup)
│   ├── __pycache__/                   # Compiled Python files
│   ├── main.py                        # FastAPI application & endpoints
│   ├── train.py                       # Model training script
│   ├── features.py                    # URL feature extraction logic
│   └── url_model.pkl                  # Pre-trained XGBoost model
│
├── backend-node/                      # Node.js Express API Gateway
│   ├── node_modules/                  # NPM packages (created during setup)
│   ├── package.json                   # Node.js dependencies
│   ├── package-lock.json              # Dependency versions lock file
│   └── index.js                       # Express server & routing
│
└── datasets/                          # Training data
    ├── apk/                           # APK analysis datasets (not used)
    └── url/
        └── phishing.csv               # Phishing URL dataset (~11k URLs)

Dataset

Phishing URL Dataset

Location: datasets/url/phishing.csv

Format:

Column	Type	Description
URL	String	The target URL to classify
label	Integer	0 = Benign, 1 = Malicious/Phishing

Statistics:

Total URLs: ~11,000
Benign URLs: ~5,800
Malicious URLs: ~5,200
Balance: Approximately balanced dataset

Feature Engineering

The system extracts 30+ features from each URL:

Structural Features:

URL, domain, path, and query lengths
Character counts (dots, hyphens, underscores, slashes, etc.)
Number of parameters and subdomains

Protocol Features:

HTTPS presence
IP-based domains
Port information
Special characters (@ sign, double slash, hex encoding, etc.)

Content Features:

Suspicious TLDs (.tk, .ml, .ga, etc.)
URL shorteners (bit.ly, tinyurl.com, etc.)
Suspicious keywords (login, verify, bank, secure, etc.)

Ratio Features:

Digit ratio in domain
Special character ratio in URL
Domain to URL length ratio

Model Training

The train.py script handles the entire training pipeline:

What it does:

Load Data - Reads phishing.csv and cleans/normalizes
Feature Extraction - Extracts 30+ features per URL
Model Training - Trains XGBoost classifier with StratifiedKFold cross-validation
Evaluation - Generates classification reports and confusion matrices
Save Model - Persists model to url_model.pkl for production use

To retrain the model:

cd ml-service
python train.py

Expected metrics:

Accuracy: 95-97%
Precision: 94-96%
Recall: 95-98%

Troubleshooting

Issue: "ModuleNotFoundError: No module named 'fastapi'"

Solution: Ensure virtual environment is activated and dependencies installed

source venv/bin/activate  # Linux/macOS
pip install -r requirements.txt

Issue: "Connection refused" when Node.js calls ML service

Solution: Ensure FastAPI service is running on port 8000

# In ml-service directory, check if running:
lsof -i :8000  # Show processes on port 8000

Issue: Port 3000 or 8000 already in use

Solution: Kill the process using the port or use a different port

# Kill process on port 8000
lsof -ti :8000 | xargs kill -9

# Or change port in index.js or main.py

Issue: "url_model.pkl not found"

Solution: Train the model first

cd ml-service
python train.py

Issue: Virtual environment not activating

Solution: Use absolute path or ensure Python 3.8+ is installed

/path/to/project/ml-service/venv/bin/python train.py

Quick Start Summary

For fastest setup, run these commands in order:

# 1. Setup Python environment
cd ml-service
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r ../requirements.txt

# 2. Train model (if needed)
python train_url_model.py

# 3. Test model
python test_model.py 
## custom test-cases can be written inside this file

# 4. Start ML service (Terminal 1)
uvicorn main:app --reload --port 8000

# 5. Setup Node.js (Terminal 2)
cd ../backend-node
npm install
node index.js

# 6. Test API (Terminal 3)
curl -X POST http://localhost:3000/predict-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

License

ISC License

Support

For issues or questions, ensure:

✅ Both services are running (FastAPI on 8000, Express on 3000)
✅ Virtual environment is activated
✅ All dependencies are installed
✅ Model file (url_model.pkl) exists
✅ Dataset exists at datasets/url/phishing.csv

Last Updated: April 2026

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
project		project
.gitignore		.gitignore
Android_Malware_Detection_Project.pdf		Android_Malware_Detection_Project.pdf
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Android-Malware-Detection

Phishing URL Detector

Table of Contents

Overview

Features

Architecture

Prerequisites

Installation & Setup

Step 1: Clone/Navigate to Project Directory

Step 2: Set Up Python Virtual Environment

Step 3: Install Python Dependencies

Step 4: Install Node.js Dependencies

Running the Project

Step 1: Train the Model (Optional - One-time Setup)

Step 2: Start the ML Service (FastAPI)

Step 3: Start the Node.js Backend (New Terminal)

API Endpoints

1. Predict URL (Primary Endpoint)

2. Health Check

Project Structure

Dataset

Phishing URL Dataset

Feature Engineering

Model Training

What it does:

To retrain the model:

Expected metrics:

Troubleshooting

Issue: "ModuleNotFoundError: No module named 'fastapi'"

Issue: "Connection refused" when Node.js calls ML service

Issue: Port 3000 or 8000 already in use

Issue: "url_model.pkl not found"

Issue: Virtual environment not activating

Quick Start Summary

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages