View on GitHub

SPAM filter based on Naive Bayes model

SPAM classifier for textual data

Download this project as a .zip file Download this project as a tar.gz file

Spam Filter

Description

A Spam filter, that classifies electronic mails using Naive Bayes approach with add-one smoothing. Features used: Bag of words with associated probabilities

Running instructions

The filter can be used ad-hoc with the model provided (spam.nb) or it can be retrained with new data.

1. Ad-hoc classification

Classification can be done using the following command

python3 nbclassify.py modelfile testfile

The modelfile in this case is spam.nb The testfile has to be formatted in such a way that each line corresponds to a single batch of text that needs to be classified.

2. Retraining and classification

Retraining is to be done in two phases: Data formatting and Model generation

Data formatting

In this stage the individual text files to be used for training are aggregated to one single training file. Follow the naming convension for naming files. SPAM# for files whose contents are spam and HAM# for other files. Put all such files into a seperate folder along with parsetraining.py and run using the following command

python3 parsetraining.py spam

Model generation

The data formatting phase generates a training file spam_training.txt and this file is used to train the model. The training is done with the following command

python3 nblearn.py trainingfile modelfile

Model generation phase can also be associated with an evaluation step using a heldout development set, which measures the performance of the model. The command is as follows

python3 nbclassify.py modelfile devfile eval

This will generate a text file evaluation.txt that contain the evaluation report