## Assignment 2: Naive Bayes

### Due date: Wednesday February 24th, End of the Day

Updated 2021 By: Ayush Sarkar and Kiran Ramnath

Updated 2020 By: Jaewook Yeom

Updated 2019 By: Kedan Li and Weilin Zhang

Updated 2018 By: Justin Lizama and Medhini Narasimhan

Spam emails are a common issue we all encounter! In this assignment, you will use the Naive Bayes algorithm to train a Spam classifier with a dataset of emails that is provided to you. The task in Part 1 is to learn a bag of words (unigram) model that will classify an email as spam or ham (not spam) based on the words it contains. In Part 2, you will combine the unigram and bigram models to achieve better performance on email classification.

## General guidelines

Basic instructions are similar to MP 1:

## Dataset

The dataset that we have provided to you consists of 1500 ham and 1500 spam emails, a subset of the Enron-Spam dataset provided by Ion Androutsopoulos. This dataset is split into 2000 training examples and 1000 development examples. This dataset is located in the "spam-data" folder of the template code provided. You will also find another dataset located under the "bigram-check" folder of the template code - this data is only used by our grade.py file to check whether your model implementation is correct. The data will be provided to you along with the code, but if you run into any problems and would like to redownload the dataset individually, you can do so here: zip tar.

## Background

The bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here, each email consists of a group of words. Using Bayes theorem, you need to compute the probability of an email being ham given the words in the email. Thus you need to estimate the posterior probabilities: $P( \mathrm{Type} = \mathrm{Ham} | \mathrm{Words}) = \frac{P(\mathrm{Type}=\mathrm{Ham})}{P(\mathrm{Words})} \prod_{\mathrm{All}~\mathrm{words}} P(\mathrm{Word}|\mathrm{Type}=\mathrm{Ham})$ $P( \mathrm{Type} = \mathrm{Spam} | \mathrm{Words}) = \frac{P(\mathrm{Type}=\mathrm{Spam})}{P(\mathrm{Words})} \prod_{\mathrm{All}~\mathrm{words}} P(\mathrm{Word}|\mathrm{Type}=\mathrm{Spam})$ You will need to use log of the probabilities to prevent underflow/precision issues; apply log to both sides of the equation. Notice that P(words) is the same in both formulas, so you can omit it (set term to 1).

## Part 1: Unigram Model

Before starting: Make sure you install the nltk package with 'pip install nltk' and/or 'pip3 install nltk', depending on which Python version you plan on using (we suggest you use Python 3.8 or 3.9). Otherwise, the provided code will not run. More information about the package can be found here
• Training Phase: Use the training set to build a bag of words model using the emails. Note that you will already be provided with the labels (ham or spam email) for the training set and the training set is already pre-processed for you, such that the training set is a list of lists of words (each list of words contains all the words in one email). The purpose of the training set is to help you calculate $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{ham})$$ and $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{spam})$$ during the testing (development) phase.

Hint: Think about how you could use the training data to help you calculate $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{ham})$$ and $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{spam})$$ during the development (testing) phase. Note that $$P(\mathrm{Word = tiger}|\mathrm{Type}=\mathrm{ham})$$ is the probability of encountering the word "tiger" in an email given that the email is not spam. After the training phase, you should be able to compute $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{ham})$$ and $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{spam})$$ for any word. Also, look into using the Counter data structure to make things easier for you coding-wise.

• Development Phase: In the development phase, you will calculate the $$P(\mathrm{Type}=\mathrm{ham}|\mathrm{Words})$$ and $$P(\mathrm{Type}=\mathrm{spam}|\mathrm{Words})$$ for each email in the development set. You will classify each email in the development set as a ham or spam email depending on which posterior probability is of higher value. You should return a list containing labels for each of the emails in the development set (label order should be the same as the document order in the given development set, so we can grade correctly). Note that your code should use only the training set to learn the individual probabilities. Do not use the development data or any external sources of information.

Hint: Note that the prior probabilities will already be known (since you will specify the positive prior probability yourself when you run the code) and remember that you can simply omit $$P(\mathrm{words})$$ by setting it to 1. Then, your only remaining task is: for each document in the development set, calculate $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{ham})$$ and $$P(\mathrm{Word}|\mathrm{Type}=\mathrm{spam})$$ for each of the words. After that, you will be able to compute $$P(\mathrm{Type}=\mathrm{ham}|\mathrm{Words})$$ and $$P(\mathrm{Type}=\mathrm{spam}|\mathrm{Words})$$ for each email in the development set using the formulas above.

Important Note: You will need to make sure you smooth the likelihoods to prevent zero probabilities. In order to accomplish this task, use Laplace smoothing. This is where the Laplace smoothing parameter will come into play. You can use the following formula for Laplace smoothing when calculating the likelihood probabilities: $$Likelihood = \frac{count(x) + k}{N + k|X|}$$ where x is a type in X, |X| is the number of types, k is the laplace smoothing parameter, and N is the total number of tokens in X. We believe this equation will become clearer once you write out the posterior probabilities without Laplace smoothing first. In other words, we suggest that you first come up with equations for the posterior probabilities without Laplace smoothing, and then examine the formula above to see how to incorporate Laplace smoothing for your purposes.

### Optional: How to change your flag

We provide some flags for you to play around when running your code (for help on how to run the code, see here).

For example, you can use the --lower_case flag to cast all words into lower case. You can also tune the laplace smoothing parameter by using the --laplace flag. You can also tune the --stemming and --pos_prior parameters. You should be able to boost the model performance up a bit by tuning these parameters. However, note that when we grade your MP, we will use our own flags (stemming, lower_case, pos_prior, laplace).

## Part 2: Unigram and Bigram Models

For Part 2, you will implement the naive Bayes algorithm over a bigram model (as opposed to a unigram model like in Part 1). Then, you will combine the bigram model and the unigram model into a mixture model defined with parameter $$\lambda$$: $(1-\lambda)P(Y) \prod_{i=1}^n P(w_i|Y) + \lambda P(Y) \prod_{i=1}^m P(b_i|Y)$ Please note that you would still have to use log to prevent underflow/precision issues. You can choose to find the best parameter $$\lambda$$ that gives the highest classification accuracy. There are additional parameters that you can tune (for more details, see here). However, I should note that you can get away with not tuning these parameters at all, since the autograder will choose these values for you when you grade. So feel free to play around with these parameters, but in the end, the hyperparameters you choose when you run on your local machine won't matter in grading.

You might need to use an alternate formulation where you interpolate the probabilities after taking the log instead of before. If you are falling short of the accuracy levels based on your overall implementation, you should try this version as well. $(1-\lambda)(log(P(Y)) + \sum_{i=1}^n log(P(w_i|Y))) + (\lambda)(log(P(Y)) + \sum_{i=1}^m log(P(b_i|Y)))$

## Part 3: Extra credit

Many of you may have heard of tf-idf, which is short for term frequency-inverse document frequency. tf-idf is used in information retrieval and text mining applications to determine how important a word is to a document within a collection of documents. The tf (term-frequency) term determines how frequently a word appears in a document, and the idf (inverse document frequency) term determines how rare (and thus perhaps more informative) a word is in the collection. A high tf-idf score might indicate that a high term frequency in a particular document and/or a low document frequency (low number of documents that contains the given word).

Like a bag-of-words, tf-idf can be used in classifying documents. For the extra credit portion, you will implement tf-idf on the same dataset as Parts 1 and 2. However, since the extra credit portion is worth only 10%, we will ask you to complete an easier task than in Parts 1 and 2. For this part, instead of classifying the documents, you will find the word with the highest tf-idf value from each of the documents in the development set. Your task is to return a list of these words (with the highest tf-idf values).

For this MP, you should treat the entire training dataset (both ham and spam documents) as the collection of documents from which to calculate your idf term. Your idf term for any given word should be calculated based on only the training dataset, not the development dataset. Meanwhile, you should calculate your tf term based on the term frequency in the document you are evaluating from the development set. Please follow this methodology to obtain results consistent with the autograder.

There are multiple ways in which tf-idf gets implemented in practice. For consistency, we will use the following formula to compute tf-idf for any given word:

$tf\text{-}idf = \frac{(\text{# of times word w appears in doc. A)}}{(\text{total # of words in doc. A})} \cdot \log{(\frac{\text{total # of docs in train set}}{\text{1 + # of docs in train set containing word w}})}$

If there are terms with the same tf-idf value within the same document, choose the first term for tie-breaking.

Sidenote: A cursory look at the words would show you the importance of governing the outputs of such methods before deploying them on products, but that's beyond the scope of this assignment. Interested students could look up literature on bias and fairness in AI.

## Provided Code Skeleton

We have provided (tar, zip) all the code to get you started on your MP.

Note that the only files you will ever have to modify are naive_bayes.py for Part 1 and naive_bayes_mixture.py for Part 2.

### Files Used for Part 1

• reader.py - This file is responsible for reading in the data set. It takes in all of the emails, splits each of those emails into a list of words, stems them if you used the --stemming flag, and then stores all of the lists of words into a larger list (so that each individual list of words corresponds with a single email). Note that this file is used for both parts of the assignment (Part 1 and Part 2).

• mp2.py - This is the main file that starts the program for Part 1, and computes the accuracy, precision, recall, and F1-score using your implementation of naive Bayes.

• naive_bayes.py - This is the file where you will be doing all of your work for Part 1. The function naiveBayes() takes as input the training data, training labels, development set, smoothing parameter, and positive prior probability. The training data provided is the output from reader.py. The training labels is the list of labels corresponding to each email in the training data. The development set is the list of emails that you are going to test your implementation on. The smoothing parameter is the laplace smoothing parameter you specified with --laplace (it is 1 by default). The positive prior probability is a value between 0 and 1 you specified with --pos_prior. You will have naiveBayes() output the predicted labels for the development set from your Naive Bayes model.
Do not modify the provided code. You will only have to modify naive_bayes.py for Part 1. Here is an example of how you would run your code for Part 1 from your terminal/shell:

python3 mp2.py --training data/spam_data/train --development data/spam_data/dev --stemming False --lower_case False --laplace 1.0 --pos_prior 0.5

Note that you can and should change the parameters as necessary. To understand more about how to run the MP for Part 1, run python3 mp2.py -h in your terminal.

### Files Used for Part 2

• reader.py - This is the same file used in Part 1 (see above for description for this file).

• mp2_mixture.py - This is the main file that starts the program for Part 2, and computes the accuracy, precision, recall, and F1-score using your implementation for Part 2 (mixture of unigram and bigram models).

• naive_bayes_mixture.py - This is the file where you will be doing all of your work for Part 2. The function naiveBayesMixture() takes as input the training data, training labels, development set, bigram lambda, unigram smoothing parameter, bigram smoothing parameter, and positive prior probability. The training data provided is the output from reader.py. The training labels is the list of labels corresponding to each email in the training data. The development set is the list of emails that you are going to test your implementation on. The lambda and smoothing parameters are all values between 0 and 1 that you specify when you run the program. The positive prior probability is a value between 0 and 1 you specified with --pos_prior. You will have naiveBayesMixture() output the predicted labels for the development set from your new Naive Bayes model.
Do not modify the provided code. You will only have to modify naive_bayes_mixture.py for Part 2. Here is an example of how you would run your code for Part 2 from your terminal/shell:

python3 mp2_mixture.py --training data/spam_data/train --development data/spam_data/dev --stemming False --lower_case True --bigram_lambda=0.5 --unigram_smoothing=0.1 --bigram_smoothing=0.1 --pos_prior 0.5

Note that you can and should change the parameters as necessary. To understand more about how to run the MP for Part 2, run python3 mp2_mixture.py -h in your terminal.

This file, similar to MP1, is for you to be able to evaluate whether or not you have implemented the unigram and mixture models correctly. This file consists of three fundamental tests:

• It checks if your mixture model approach is correct (whether or not your model is actually a mixture of unigram and bigram models), using the data located under bigram_check
• It checks if your unigram model is able to reach the correct accuracy threshold for the development set provided
• It checks if your mixture model is able to reach the correct accuracy threshold for the development set provided
Unlike the grade.py file provided to you in MP1, this version provides you with only a subset of the tests that your code will undergo on gradescope. This file should at least give you a general idea on whether or not your implementations are correct. Your code will go through more tests on datasets not provided to you once it is submitted to gradescope. Here is an example of how you would run grade.py from your terminal/shell:

### Files Used for Extra Credit

• reader.py - This is the same file used in Part 1 and Part 2 (see above for description for this file).
• mp2_tf_idf.py - This is the main file that runs the program for the Extra Credit Part.
• tf_idf.py - This is the file where you will be doing all of your work for the Extra Credit Part. The function compute_tf_idf() takes as input the training data, training labels, and development set. The training data provided is the output from reader.py. The training labels is the list of labels corresponding to each email in the training data. The development set is the list of emails from which you will extract the words with the highest tf-idf values. You should return a list (with size equal to that of dev_set) that contains words from the development set with the highest tf-idf values. You should choose the word with the highest tf-idf value from each document in the development set.

Do not modify the provided code. You will only have to modify tf_idf.py for the Extra Credit Part. Here is an example of how you would run your code for the Extra Credit Part from your terminal/shell:

python3 mp2_tf_idf.py --training data/spam_data/train --development data/spam_data/dev

## What to Submit

Submit only your naive_bayes.py file and naive_bayes_mixture.py file to Gradescope, under the assignment called MP2: Naive Bayes.

If you completed the extra credit portion, submit only the tf_idf.py file to Gradescope, under the assignment called MP2: Naive Bayes Extra Credit