Skip to content

namidanam/Android-Malware-Detection

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Android-Malware-Detection

Phishing URL Detector

(For smooth execution without node-server look upto readme inside project folder)

A machine learning-powered phishing detection system that identifies malicious URLs using advanced feature extraction and classification. Built with a Python FastAPI ML service and Node.js Express backend.

Table of Contents


Overview

This project detects whether a given URL is benign, suspicious, or malicious using machine learning classification. The system:

  • Extracts 30+ features from URLs without making network requests
  • Uses XGBoost classifier for high-accuracy predictions
  • Provides REST API endpoints for easy integration
  • Returns confidence scores and detailed predictions

Risk Score Categories:

  • 0-30%: Benign (safe)
  • 30-60%: Suspicious (caution advised)
  • 60-100%: Malicious (likely phishing)

Features

URL-based feature extraction - No external API calls or network requests
Fast inference - Pre-loaded model, response in milliseconds
Detailed predictions - Risk scores, labels, and extracted features
Production-ready - FastAPI with health checks and error handling
Scalable architecture - Separated ML service and API backend
Easy deployment - Docker-compatible, minimal dependencies


Architecture

┌─────────────────────────────────────────────────────────┐
│          Node.js Express Backend (Port 3000)            │
│                  (API Gateway)                          │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ HTTP Request
                     │ POST /predict-url
                     ▼
┌─────────────────────────────────────────────────────────┐
│      Python FastAPI ML Service (Port 8000)              │
│  ┌───────────────────────────────────────────────────┐  │
│  │ Feature Extraction (30+ URL features)             │  │
│  └───────────────────────────────────────────────────┘  │ 
│  ┌───────────────────────────────────────────────────┐  │
│  │ XGBoost Model (Pre-loaded & Cached)               │  │
│  └───────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────┐  │
│  │ Risk Assessment & Response Generation             │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Prerequisites

Before starting, ensure you have installed:

To verify installations:

python --version
node --version
npm --version

Installation & Setup

Step 1: Clone/Navigate to Project Directory

cd /path/to/project

Step 2: Set Up Python Virtual Environment

Navigate to the ml-service directory and create a virtual environment:

cd ml-service
python -m venv venv

Activate the virtual environment:

On Linux/macOS:

source venv/bin/activate

On Windows:

venv\Scripts\activate

Step 3: Install Python Dependencies

pip install -r ../requirements.txt

Key Python packages installed:

  • fastapi - Web framework for ML service
  • uvicorn - ASGI server
  • xgboost - ML classification model
  • pandas - Data processing
  • numpy - Numerical computing
  • scikit-learn - Machine learning utilities
  • joblib - Model serialization

Step 4: Install Node.js Dependencies

In a new terminal, navigate to the backend-node directory:

cd backend-node
npm install

This installs:

  • express - Web framework
  • axios - HTTP client for ML service communication
  • cors - Cross-origin resource sharing middleware

Running the Project

Step 1: Train the Model (Optional - One-time Setup)

If you don't have a pre-trained model (url_model.pkl), train it first:

cd ml-service
python train.py

This will:

  • Load the dataset from datasets/url/phishing.csv
  • Extract URL features
  • Train the XGBoost classifier
  • Save the model to url_model.pkl
  • Display training metrics and cross-validation scores

Expected output:

Dataset: 11000 URLs  |  Malicious: 5200  |  Benign: 5800
Extracting features (URL-only, no network calls)...
  0 / 11000
  10000 / 11000
Done. Failed extractions: 2

Training XGBoost model with StratifiedKFold...
[Results and metrics displayed]

Model saved to url_model.pkl

Step 2: Start the ML Service (FastAPI)

In the ml-service directory (with venv activated), run:

uvicorn main:app --reload --port 8000

Expected output:

INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started server process [12345]
INFO:     Waiting for connections...

The FastAPI service is now running. You can access:

  • API: http://localhost:8000
  • Interactive Docs: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Step 3: Start the Node.js Backend (New Terminal)

In the backend-node directory, run:

node index.js

Expected output:

Node running on 3000

API Endpoints

1. Predict URL (Primary Endpoint)

Endpoint: POST http://localhost:3000/predict-url

Request Body:

{
  "url": "https://www.example.com/login"
}

Response:

{
  "url": "https://www.example.com/login",
  "risk_score": 15.3,
  "label": "benign",
  "features": {
    "url_length": 35,
    "domain_length": 15,
    "has_https": 1,
    "num_dots": 2,
    ...
  }
}

Response Fields:

  • url - The analyzed URL
  • risk_score - Confidence percentage (0-100)
  • label - Verdict: "benign", "suspicious", or "malicious"
  • features - All extracted features (can be removed for production)

2. Health Check

Endpoint: GET http://localhost:8000/health

Response:

{
  "status": "ok",
  "features_loaded": 30
}

Project Structure

project/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
│
├── ml-service/                        # Python ML Service (FastAPI)
│   ├── venv/                          # Virtual environment (created during setup)
│   ├── __pycache__/                   # Compiled Python files
│   ├── main.py                        # FastAPI application & endpoints
│   ├── train.py                       # Model training script
│   ├── features.py                    # URL feature extraction logic
│   └── url_model.pkl                  # Pre-trained XGBoost model
│
├── backend-node/                      # Node.js Express API Gateway
│   ├── node_modules/                  # NPM packages (created during setup)
│   ├── package.json                   # Node.js dependencies
│   ├── package-lock.json              # Dependency versions lock file
│   └── index.js                       # Express server & routing
│
└── datasets/                          # Training data
    ├── apk/                           # APK analysis datasets (not used)
    └── url/
        └── phishing.csv               # Phishing URL dataset (~11k URLs)

Dataset

Phishing URL Dataset

Location: datasets/url/phishing.csv

Format:

Column Type Description
URL String The target URL to classify
label Integer 0 = Benign, 1 = Malicious/Phishing

Statistics:

  • Total URLs: ~11,000
  • Benign URLs: ~5,800
  • Malicious URLs: ~5,200
  • Balance: Approximately balanced dataset

Feature Engineering

The system extracts 30+ features from each URL:

Structural Features:

  • URL, domain, path, and query lengths
  • Character counts (dots, hyphens, underscores, slashes, etc.)
  • Number of parameters and subdomains

Protocol Features:

  • HTTPS presence
  • IP-based domains
  • Port information
  • Special characters (@ sign, double slash, hex encoding, etc.)

Content Features:

  • Suspicious TLDs (.tk, .ml, .ga, etc.)
  • URL shorteners (bit.ly, tinyurl.com, etc.)
  • Suspicious keywords (login, verify, bank, secure, etc.)

Ratio Features:

  • Digit ratio in domain
  • Special character ratio in URL
  • Domain to URL length ratio

Model Training

The train.py script handles the entire training pipeline:

What it does:

  1. Load Data - Reads phishing.csv and cleans/normalizes
  2. Feature Extraction - Extracts 30+ features per URL
  3. Model Training - Trains XGBoost classifier with StratifiedKFold cross-validation
  4. Evaluation - Generates classification reports and confusion matrices
  5. Save Model - Persists model to url_model.pkl for production use

To retrain the model:

cd ml-service
python train.py

Expected metrics:

  • Accuracy: 95-97%
  • Precision: 94-96%
  • Recall: 95-98%

Troubleshooting

Issue: "ModuleNotFoundError: No module named 'fastapi'"

Solution: Ensure virtual environment is activated and dependencies installed

source venv/bin/activate  # Linux/macOS
pip install -r requirements.txt

Issue: "Connection refused" when Node.js calls ML service

Solution: Ensure FastAPI service is running on port 8000

# In ml-service directory, check if running:
lsof -i :8000  # Show processes on port 8000

Issue: Port 3000 or 8000 already in use

Solution: Kill the process using the port or use a different port

# Kill process on port 8000
lsof -ti :8000 | xargs kill -9

# Or change port in index.js or main.py

Issue: "url_model.pkl not found"

Solution: Train the model first

cd ml-service
python train.py

Issue: Virtual environment not activating

Solution: Use absolute path or ensure Python 3.8+ is installed

/path/to/project/ml-service/venv/bin/python train.py

Quick Start Summary

For fastest setup, run these commands in order:

# 1. Setup Python environment
cd ml-service
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r ../requirements.txt

# 2. Train model (if needed)
python train_url_model.py

# 3. Test model
python test_model.py 
## custom test-cases can be written inside this file

# 4. Start ML service (Terminal 1)
uvicorn main:app --reload --port 8000

# 5. Setup Node.js (Terminal 2)
cd ../backend-node
npm install
node index.js

# 6. Test API (Terminal 3)
curl -X POST http://localhost:3000/predict-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

License

ISC License


Support

For issues or questions, ensure:

  • ✅ Both services are running (FastAPI on 8000, Express on 3000)
  • ✅ Virtual environment is activated
  • ✅ All dependencies are installed
  • ✅ Model file (url_model.pkl) exists
  • ✅ Dataset exists at datasets/url/phishing.csv

Last Updated: April 2026

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.9%
  • JavaScript 1.1%