This repository contains the implementation of Project 4 for INFO411 Data Mining and Knowledge Discovery course. The project focuses on web spam detection using the UK2007 benchmark dataset, implementing and comparing different classification methods across multiple feature sets.
Task: Document Classification (Web Spam Detection)
Web spam refers to activities intended to mislead search engines into believing that a particular web page has high authority value for specific queries, while the page may contain little or no relevant information. Search engines rank URLs based on:
- Content relevance - how well page content matches the query
- Page popularity - typically measured by link-based metrics
Spam techniques are classified into two main categories:
- Web pages linked by many other sites to artificially increase popularity
- Often involves "link farms" where links are automatically generated
- Exploits popularity-based ranking factors
- Web pages contain terms that are visually hidden from users
- Terms are irrelevant to actual content but indexable by search engines
- Increases probability of appearing in search results
The UK2007 dataset is a large collection of annotated spam/nonspam hosts:
- Size: 105,896,555 pages across 114,529 hosts in the .UK domain
- Host IDs: Numbered from 0 to 114,528 (same ordering as in uk-2007-05.hostnames.txt.gz)
- Labeling: Tagged at host level by volunteers
- Labels: Available from WEBSPAM-UK2007
- Features: Available from UK2007 Features
This project uses re-computed feature sets provided for the Web Spam Challenge 2008. All feature sets are available in CSV, Matlab, and ARFF (Weka) formats.
- File:
1.uk-2007-05.obvious_features.csv(renamed from downloaded file) - Download: uk-2007-05.obvious_features.csv.gz (1.3 MB)
- Description: Computed from graph files, includes two direct, obvious features
- Content:
- Number of pages in the host
- Number of characters in the host name
- File: Available but not used in this project
- Download: uk-2007-05.link_based_features.csv.gz (19 MB)
- Description: Raw link-based features computed from graph files
- Content: In-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, supporter estimates, etc.
- Additional File: uk-2007-05.homepageuid_maxpruid.csv.gz (home page and max PageRank page URL-IDs)
- File:
2.uk-2007-05.link_based_features_transformed.csv(renamed from downloaded file) - Download: uk-2007-05.link_based_features_transformed.csv.gz (68 MB)
- Description: Numeric transformations of link-based features proven more effective for classification
- Content:
- Feature ratios (e.g., Indegree/PageRank, TrustRank/PageRank)
- Logarithmic transformations of several features
- Optimized transformations for better classification performance
- File:
3.uk-2007-05.content_based_features.csv(renamed from downloaded file) - Download: uk-2007-05.content_based_features.csv.gz (47 MB)
- Description: Features computed from page content summaries
- Content:
- Number of words in home page
- Average word length
- Average title length
- Additional content-based statistics for sample pages on each host
The spam/nonspam labels are provided in two sets:
- SET1:
WEBSPAM-UK2007-SET1-labels.txt(training set, ~2/3 of data) - SET2:
WEBSPAM-UK2007-SET2-labels.txt(test set, ~1/3 of data)
Label Statistics:
- SET1: 3,776 nonspam, 222 spam, 277 undecided
- SET2: 1,933 nonspam, 122 spam, 149 undecided
For users who want to download and prepare the data themselves:
-
Download Feature Sets:
# Download and extract feature set 1 (Direct Features) wget https://chato.cl/webspam/datasets/uk2007/features/uk-2007-05.obvious_features.csv.gz gunzip uk-2007-05.obvious_features.csv.gz mv uk-2007-05.obvious_features.csv 1.uk-2007-05.obvious_features.csv # Download and extract feature set 2b (Transformed Link-based Features) wget https://chato.cl/webspam/datasets/uk2007/features/uk-2007-05.link_based_features_transformed.csv.gz gunzip uk-2007-05.link_based_features_transformed.csv.gz mv uk-2007-05.link_based_features_transformed.csv 2.uk-2007-05.link_based_features_transformed.csv # Download and extract feature set 3a (Content-based Features) wget https://chato.cl/webspam/datasets/uk2007/features/uk-2007-05.content_based_features.csv.gz gunzip uk-2007-05.content_based_features.csv.gz mv uk-2007-05.content_based_features.csv 3.uk-2007-05.content_based_features.csv
-
Download Label Files:
# Download label files from the main dataset page wget https://chato.cl/webspam/datasets/uk2007/WEBSPAM-UK2007-SET1-labels.txt wget https://chato.cl/webspam/datasets/uk2007/WEBSPAM-UK2007-SET2-labels.txt -
Optional - Download Additional Files:
# Host names file (if needed for reference) wget https://chato.cl/webspam/datasets/uk2007/uk-2007-05.hostnames.txt.gz # Raw link-based features (if you want to compare with transformed features) wget https://chato.cl/webspam/datasets/uk2007/features/uk-2007-05.link_based_features.csv.gz # Home page and max PageRank page mapping wget https://chato.cl/webspam/datasets/uk2007/features/uk-2007-05.homepageuid_maxpruid.csv.gz
- Feature Set Evaluation: Identify which feature set provides the best predictive power
- Method Selection: Deploy the most suitable classification methods for each feature set
- Performance Ranking: Rank feature sets by quality using AUC (Area Under ROC Curve)
- Comprehensive Analysis: Compare results and explain findings
- Combination Analysis: Evaluate performance of combined feature sets
- Present general properties and characteristics of the UK2007 dataset
- Provide comprehensive exploratory data analysis
- Deploy appropriate classification methods to each feature set
- Justify method selection for each feature set
- Present and analyze results
- Discuss strengths and weaknesses in context of web spam detection
- Rank feature sets by predictive performance
- Use AUC as primary comparison metric
- Expected ranking based on domain knowledge:
- Content-based Features (best)
- Link-based Features
- Direct Features (poorest)
- Comprehensive analysis using AUC comparisons
- Detailed explanation of findings
- Discussion of classification method performance
- Deploy classification methods on combined feature sets
- Present results with qualitative comparisons
- Analyze performance improvements/degradations
- Summarize new and interesting discoveries
- Discuss implications for web spam detection
- Provide recommendations for future work
- Source Code:
output.rmd(RMarkdown file) - Generated Report:
output.html(Final analysis report)
# Install required packages if not already installed
packages <- c("tidyverse", "skimr", "corrplot", "scales",
"kableExtra", "caret", "pROC", "randomForest", "e1071")
install.packages(packages)In R/RStudio Console:
rmarkdown::render("output.rmd")In Terminal:
Rscript -e "rmarkdown::render('output.rmd')"├── README.md # This file
├── output.rmd # Main analysis code
├── output.html # Generated report
├── 1.uk-2007-05.obvious_features.csv # Direct features
├── 2.uk-2007-05.link_based_features_transformed.csv # Link-based features
├── 3.uk-2007-05.content_based_features.csv # Content-based features
├── WEBSPAM-UK2007-SET1-labels.txt # Training labels
└── WEBSPAM-UK2007-SET2-labels.txt # Test labels
The analysis follows a systematic approach:
- Data Exploration: Comprehensive EDA of each feature set
- Preprocessing: Data cleaning, handling missing values, feature scaling
- Model Selection: Justify and implement appropriate classification algorithms
- Evaluation: Use cross-validation and AUC metrics for robust comparison
- Feature Combination: Systematic evaluation of combined feature sets
- Results Interpretation: Domain-specific analysis of findings
- Primary Metric: AUC (Area Under ROC Curve)
- Additional Metrics: Accuracy, Precision, Recall, F1-Score
- Validation: Cross-validation and train/test split evaluation
Based on domain knowledge and previous research:
- Content-based features should perform best (direct relevance to spam content)
- Link-based features should show moderate performance (link farm detection)
- Direct features should have limited predictive power (too simple)
- Combined features should outperform individual feature sets
- Manning, C., Raghavan, P., & Schütze, H. (2008). "An introduction to information retrieval", Cambridge University Press
- Brin, S., & Page, L. (1998). The anatomy of a large–scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30, 107–117.
- Gyöngyi, Z., & Garcia-Molina, H. (2005). "Web spam taxonomy". Adversarial Information Retrieval on the Web
- Course: INFO411 Data Mining and Knowledge Discovery
- Project: Project 4 - Document Classification (Web Spam Detection)
- Institution: University of Wollongong
- Dataset: UK2007 Web Spam Benchmark
For questions or issues with the analysis, please refer to the detailed implementation in output.rmd or the generated report output.html.