GitHub - shivgarg/spam-classification: SVM implementation to classify email as spam or not spam

Spam Classification

This project classified mail whether it spam or not spam. SVM has been used to classify the mails. cvx and libSVM package have been used. The packages are a part of repository. The code is in matlab. There are scripts using two kernels, one linear and one gaussian.

Dataset

Data is a subset of 2005 TREC Public Spam Corpus. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurrence numbers in this email. The dataset presented to you is processed version of the original dataset where non-word characters have been removed and some basic feature selection has been done.

Usage

Run transform_data.py. It parses the dataset and produces two files , one with features and one with classifcation of mail.
Use the script in this way : `python transform_data.py <no. of lines on train data> <no. of lines in test data file>
Setup cvx into matlab or octave. Follow the instructions given in cvx package.
Run the script to get the accuracy as output. More features can be added to dataset by changing the python script.
To use the libSVm , setup it up using the instructionsin libSVM package.
Run the matlab script to get the accuracy.

Report on performance of SVM on this particular dataset and the results are included in Analysis.docx file. No major feature engineering was done to get the results. Many more things can be done to increase the performance of the system.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Support vectors ID		Support vectors ID
code		code
cvx		cvx
libsvm-3.20		libsvm-3.20
spam_data		spam_data
Analysis.docx		Analysis.docx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Classification

Dataset

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spam Classification

Dataset

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages