Skip to content
Snippets Groups Projects
Commit a61be66a authored by Zeynep Hakguder's avatar Zeynep Hakguder
Browse files

Naive Bayes

parent 86912ed1
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Naive Bayes Spam Classifier
In this part of the assignment, we will
* implement a Naive Bayes spam classifier
* address sparse data problem with **pseudocounts** (**$m$-estimate**)
A skeleton of a general supervised learning model is provided in "model.ipynb". We won't make any implementations for this part of homework 2. A confusion matrix implementation is provided for you in "model.ipynb".
### Note:
You are not required to follow this exact template. You can change what parameters your functions take or partition the tasks across functions differently. However, make sure there are outputs and implementation for items listed in the rubric for each task. Also, indicate in code with comments which task you are attempting.
%% Cell type:markdown id: tags:
# GRADING
You will be graded on parts that are marked with **\#TODO** comments. Read the comments in the code to make sure you don't miss any.
### Mandatory for 478 & 878:
| | Tasks | 478 | 878 |
|---|----------------------------|-----|-----|
| 1 | Implement `fit` method | 25 | 25 |
| 2 | Implement `predict` method | 25 | 25 |
Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There are a total of 50 points in this part of assignment 2 for both 478 and 878 students.
%% Cell type:code id: tags:
``` python
import numpy as np
import json
%run 'model.ipynb'
```
%% Cell type:markdown id: tags:
We will use the Enron spam/ham dataset for spam filtering. The emails are already tokenized. Below code reads in the processed data. There are 33,702 emails: 17157 spam and 16545 ham.
**Please do not change the order of the test indices as you'll be graded on results for the first 5 test examples.**
%% Cell type:code id: tags:
``` python
# load tokenized email texts
with open('../data/enron_text.json') as f:
X = json.load(f)
# load email labels
with open('../data/enron_label.json') as f:
Y = json.load(f)
with open('../data/enron_split.json') as f:
indices = json.load(f)
indices.keys()
```
%% Output
dict_keys(['training', 'test'])
%% Cell type:markdown id: tags:
X is a list of lists. Each of its 33,702 entries correspond to a tokenized email. Y is a list. Each entry is a label for the tokenized email at the corresponding position in X. Below are 5 random emails and their corresponding labels.
%% Cell type:code id: tags:
``` python
# number of emails
size = len(Y)
# randomly select and print some emails
ind_list = np.random.choice(size, 5)
for ind in ind_list:
print('Tokenized text: {}'.format(X[ind]))
print('Label: {}'.format(Y[ind]))
```
%% Output
Tokenized text: ['subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'tim', 'belden', 's', 'office', 'referred', 'me', 'to', 'you', 'has', 'grant', 'masson', 'been', 'replaced', 'original', 'message', 'from', 'vince', 'j', 'kaminski', 'enron', 'com', 'mailto', 'vince', 'j', 'kaminski', 'enron', 'com', 'sent', 'monday', 'november', '06', '2000', '11', '00', 'am', 'to', 'niam', 'infocastinc', 'com', 'subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'nia', 'please', 'contact', 'tim', 'belden', 'in', 'our', 'portland', 'office', 'his', 'phone', 'number', 'is', '503', '464', '3820', 'vince', 'nia', 'mansell', 'on', '11', '03', '2000', '12', '46', '52', 'pm', 'to', 'vkamins', 'enron', 'com', 'cc', 'subject', 'power', 'crisis', 'in', 'the', 'west', 'dear', 'vince', 'i', 'spoke', 'with', 'you', 'briefly', 'yesterday', 'regarding', 'grant', 'masson', 'you', 'informed', 'me', 'that', 'he', 'is', 'no', 'longer', 'an', 'enron', 'employee', 'i', 'have', 'also', 'been', 'informed', 'that', 'grant', 'has', 'not', 'yet', 'been', 'replaced', 'i', 'am', 'inquiring', 'because', 'infocast', 'would', 'like', 'to', 'have', 'an', 'enron', 'representative', 'speak', 'at', 'an', 'upcoming', 'conference', 'entitled', 'power', 'crisis', 'in', 'the', 'west', 'status', 'it', 'is', 'certainly', 'going', 'to', 'be', 'an', 'exciting', 'conference', 'due', 'to', 'all', 'of', 'the', 'controversy', 'surrounding', 'the', 'situation', 'in', 'san', 'diego', 'kind', 'regards', 'nia', 'mansell', 'infocast', 'conference', 'manager', '818', '888', '4445', 'ext', '45', '818', '888', '4440', 'fax', 'niam', 'com', 'see', 'attached', 'file', 'power', 'crisis', 'in', 'the', 'west', 'invite', 'doc', '']
Label: 0
Tokenized text: ['subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'fyi', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '18', '2001', '03', '07', 'pm', 'angela', 'dorsey', 'on', '01', '18', '2001', '02', '53', '59', 'pm', 'to', 'cc', 'subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'karen', 'thanks', 'for', 'the', 'extra', 'support', 'in', 'getting', 'the', 'word', 'out', 'i', 've', 'had', 'a', 'couple', 'rsvp', 's', 'from', 'enron', 'sincerely', 'angela', 'original', 'message', 'from', 'karen', 'marshall', 'enron', 'com', 'mailto', 'karen', 'marshall', 'enron', 'com', 'sent', 'wednesday', 'january', '17', '2001', '7', '59', 'pm', 'to', 'david', 'haug', 'enron', 'com', 'gary', 'hickerson', 'enron', 'com', 'cchilde', 'enron', 'com', 'thomas', 'suffield', 'enron', 'com', 'ben', 'f', 'glisan', 'enron', 'com', 'ermes', 'melinchon', 'enron', 'com', 'hal', 'elrod', 'enron', 'com', 'clay', 'spears', 'enron', 'com', 'kelly', 'mahmoud', 'enron', 'com', 'ellen', 'fowler', 'enron', 'com', 'kevin', 'kuykendall', 'enron', 'com', 'fred', 'mitro', 'enron', 'com', 'kyle', 'kettler', 'enron', 'com', 'jeff', 'bartlett', 'enron', 'com', 'paul', 'j', 'broderick', 'enron', 'com', 'john', 'house', 'enron', 'com', 'george', 'mccormick', 'enron', 'com', 'guido', 'caranti', 'enron', 'com', 'ken', 'sissingh', 'enron', 'com', 'gwynn', 'gorsuch', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'shawn', 'cumberland', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'paul', 'burkhart', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'brian', 'kerrigan', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'christie', 'patrick', 'enron', 'com', 'michael', 'b', 'rosen', 'enron', 'com', 'cindy', 'derecskey', 'enron', 'com', 'cc', 'elyse', 'kalmans', 'enron', 'com', 'richard', 'causey', 'enron', 'com', 'sally', 'beck', 'enron', 'com', 'vince', 'j', 'kaminski', 'enron', 'com', 'jeffrey', 'a', 'shankman', 'enron', 'com', 'angela', 'dorsey', 'subject', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'the', '500', 'registration', 'fee', 'is', 'waived', 'for', 'any', 'enron', 'employee', 'who', 'wishes', 'to', 'attend', 'this', 'conference', 'because', 'of', 'our', 'relationship', 'with', 'the', 'school', 'please', 'forward', 'this', 'information', 'to', 'your', 'managers', 'and', 'staff', 'members', 'who', 'would', 'benefit', 'from', 'participating', 'in', 'this', 'important', 'conference', 'note', 'vince', 'kaminski', 'is', 'a', 'panellist', 'for', 'the', 'risk', 'management', 'session', '3', 'please', 'note', 'the', 'deadline', 'for', 'rsvp', 'hotel', 'reservations', 'is', 'monday', 'january', '22', 'nd', 'don', 't', 'miss', 'this', 'opportunity', 'should', 'you', 'have', 'any', 'questions', 'please', 'feel', 'free', 'to', 'contact', 'me', 'at', 'ext', '37632', 'karen', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '11', '2001', '07', '38', 'pm', 'angela', 'dorsey', 'on', '01', '10', '2001', '03', '06', '18', 'pm', 'to', 'angela', 'dorsey', 'cc', 'ehud', 'ronn', 'sheridan', 'titman', 'e', 'mail', 'subject', 'invitation', 'to', '2001', 'energy', 'finance', 'conference', 'the', 'university', 'of', 'texas', 'at', 'austin', 'colleagues', 'and', 'friends', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'cefer', 'happy', 'new', 'year', 'hope', 'you', 'all', 'had', 'a', 'wonderful', 'holiday', 'season', 'on', 'behalf', 'of', 'the', 'university', 'of', 'texas', 'finance', 'department', 'and', 'cefer', 'we', 'would', 'like', 'to', 'cordially', 'invite', 'you', 'to', 'attend', 'our', '2001', 'energy', 'finance', 'conference', 'austin', 'texas', 'february', '22', '23', '2001', 'hosted', 'by', 'the', 'university', 'of', 'texas', 'finance', 'department', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'dr', 'ehud', 'i', 'ronn', 'and', 'dr', 'sheridan', 'titman', 'are', 'currently', 'in', 'the', 'process', 'of', 'finalizing', 'the', 'details', 'of', 'the', 'conference', 'agenda', 'we', 'have', 'listed', 'the', 'agenda', 'outline', 'below', 'to', 'assist', 'you', 'in', 'your', 'travel', 'planning', 'each', 'conference', 'session', 'will', 'be', 'composed', 'of', 'a', 'panel', 'discussion', 'between', '3', '4', 'guest', 'speakers', 'on', 'the', 'designated', 'topic', 'as', 'supporters', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'representatives', 'of', 'our', 'trustee', 'corporations', 'enron', 'el', 'paso', 'reliant', 'conoco', 'and', 'southern', 'will', 'have', 'the', '500', 'conference', 'fee', 'waived', 'the', 'conference', 'package', 'includes', 'thursday', 'evening', 's', 'cocktails', 'dinner', 'and', 'hotel', 'ut', 'shuttle', 'service', 'as', 'well', 'as', 'friday', 's', 'conference', 'meals', 'session', 'materials', 'and', 'shuttle', 'service', 'travel', 'to', 'austin', 'and', 'hotel', 'reservations', 'are', 'each', 'participant', 's', 'responsibility', 'a', 'limited', 'number', 'of', 'hotel', 'rooms', 'are', 'being', 'tentatively', 'held', 'at', 'the', 'radisson', 'hotel', 'on', 'town', 'lake', 'under', 'the', 'group', 'name', 'university', 'of', 'texas', 'finance', 'department', 'for', 'the', 'nights', 'of', 'thursday', '2', '22', '01', 'and', 'friday', '2', '23', '01', 'the', 'latter', 'evening', 'for', 'those', 'who', 'choose', 'to', 'stay', 'in', 'austin', 'after', 'the', 'conference', 's', 'conclusion', 'to', 'guarantee', 'room', 'reservations', 'you', 'will', 'need', 'to', 'contact', 'the', 'radisson', 'hotel', 'at', '512', '478', '9611', 'no', 'later', 'than', 'monday', 'january', '22', 'nd', 'and', 'make', 'your', 'reservations', 'with', 'a', 'credit', 'card', 'please', 'let', 'me', 'know', 'when', 'you', 'have', 'made', 'those', 'arrangements', 'so', 'that', 'i', 'can', 'make', 'sure', 'the', 'radisson', 'gives', 'you', 'the', 'special', 'room', 'rate', 'of', '129', 'night', 'please', 'rsvp', 'your', 'interest', 'in', 'attending', 'this', 'conference', 'no', 'later', 'than', 'january', '22', 'nd', 'to', 'angela', 'dorsey', 'bus', 'utexas', 'edu', 'or', '512', '232', '7386', 'as', 'seating', 'availability', 'is', 'limited', 'please', 'feel', 'free', 'to', 'extend', 'this', 'invitation', 'to', 'your', 'colleagues', 'who', 'might', 'be', 'interested', 'in', 'attending', 'this', 'conference', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'program', 'of', 'the', '2001', 'energy', 'finance', 'conference', 'february', '22', '23', '2001', 'thursday', 'feb', '22', '3', '00', 'p', 'm', 'reserved', 'rooms', 'at', 'the', 'radisson', 'hotel', 'available', 'for', 'check', 'in', '5', '30', 'p', 'm', 'bus', 'will', 'pick', 'up', 'guests', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', 'club', '6', '00', 'p', 'm', 'cocktails', 'ut', 'club', '9', 'th', 'floor', '7', '00', 'p', 'm', 'dinner', 'ut', 'club', '8', '00', 'p', 'm', 'keynote', 'speaker', '9', '00', 'p', 'm', 'bus', 'will', 'transport', 'guests', 'back', 'to', 'hotel', 'friday', 'feb', '23', '7', '45', 'a', 'm', 'bus', 'will', 'pick', 'up', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', '8', '30', 'a', 'm', 'session', '1', 'real', 'options', 'panelists', 'jim', 'dyer', 'ut', 'chair', 'sheridan', 'titman', 'ut', 'john', 'mccormack', 'stern', 'stewart', 'co', '10', '00', 'a', 'm', 'coffee', 'break', '10', '15', 'a', 'm', 'session', '2', 'deregulation', 'panelists', 'david', 'eaton', 'ut', 'chair', 'david', 'spence', 'ut', 'jeff', 'sandefer', 'sandefer', 'capital', 'partners', 'ut', 'peter', 'nance', 'teknecon', 'energy', 'risk', 'advisors', '11', '45', 'a', 'm', 'catered', 'lunch', 'keynote', 'speaker', '1', '30', 'p', 'm', 'guest', 'tour', 'eds', 'financial', 'trading', 'technology', 'center', '2', '00', 'p', 'm', 'session', '3', 'risk', 'management', 'panelists', 'keith', 'brown', 'ut', 'chair', 'vince', 'kaminski', 'enron', 'alexander', 'eydeland', 'southern', 'co', 'ehud', 'i', 'ronn', 'ut', '3', '30', 'p', 'm', 'snack', 'break', '3', '45', 'p', 'm', 'session', '4', 'globalization', 'of', 'the', 'energy', 'business', 'panelists', 'laura', 'starks', 'ut', 'chair', 'bob', 'goldman', 'conoco', 'ray', 'hill', 'southern', 'co', '5', '15', 'p', 'm', 'wrap', 'up', '5', '30', 'p', 'm', 'bus', 'picks', 'up', 'for', 'transport', 'to', 'airport', 'dinner', '6', '30', 'p', 'm', 'working', 'dinner', 'for', 'senior', 'officers', 'of', 'energy', 'finance', 'center', 'trustees', 'we', 'have', 'made', 'arrangements', 'to', 'provide', 'shuttle', 'service', 'between', 'the', 'radisson', 'hotel', 'and', 'ut', 'during', 'the', 'conference', 'however', 'if', 'you', 'choose', 'to', 'stay', 'at', 'an', 'alternative', 'hotel', 'then', 'transportation', 'to', 'conference', 'events', 'will', 'become', 'your', 'responsibility', 'angela', 'dorsey', 'assistant', 'director', 'center', 'for', 'energy', 'finance', 'education', 'research', 'the', 'university', 'of', 'texas', 'at', 'austin', 'department', 'of', 'finance', 'cba', '6', '222', 'austin', 'tx', '78712', 'angela', 'dorsey', 'bus', 'utexas', 'edu', '']
Label: 0
Tokenized text: ['subject', 'interview', 'jaesoo', 'lew', '10', '25', '00', 'attached', 'please', 'find', 'the', 'resume', 'interview', 'schedule', 'and', 'evaluation', 'form', 'for', 'jaesoo', 'lew', 'jaesoo', 'will', 'be', 'interviewing', 'with', 'vince', 'kaminski', 's', 'group', 'on', 'an', 'exploratory', 'basis', 'on', 'october', '25', '2000', 'please', 'contact', 'me', 'with', 'any', 'comments', 'or', 'concerns', 'thank', 'you', 'cheryl', 'arguijo', 'ena', 'recruiting', '713', '345', '4016']
Label: 0
Tokenized text: ['subject', 'custom', 'marketing', 'to', 'webmaster', 'ezmlm', 'org', 'email', 'is', 'the', 'best', 'promote', 'tool', 'we', 'offer', 'online', 'marketing', 'with', 'quality', 'service', '1', 'target', 'email', 'list', 'we', 'can', 'provide', 'target', 'email', 'list', 'you', 'need', 'which', 'are', 'compiled', 'only', 'on', 'your', 'order', 'we', 'will', 'customize', 'your', 'client', 'email', 'list', 'we', 'have', 'millions', 'of', 'lists', 'in', 'a', 'wide', 'variety', 'of', 'categories', '2', 'send', 'out', 'target', 'list', 'for', 'you', 'we', 'can', 'send', 'your', 'email', 'message', 'to', 'your', 'target', 'clients', 'we', 'will', 'customize', 'your', 'email', 'list', 'and', 'send', 'your', 'message', 'for', 'you', 'our', 'site', 'www', 'marketingforus', 'com', 'we', 'also', 'offer', 'web', 'hosting', 'mail', 'server', 'regards', 'jason', 'marketing', 'support', 'sales', 'marketingforus', 'com', 'no', 'thanks', 'byebye', 'msn', 'com', 'subject', 'webmaster', 'ezmlm', 'org']
Label: 1
Tokenized text: ['subject', 'popular', 'software', 'at', 'low', 'low', 'prices', 'alaina', 'windows', 'xp', 'professional', '2002', '50', 'adobe', 'photoshop', '7', '0', '50', 'microsoft', 'office', 'xp', 'professional', '2002', '50', 'corel', 'draw', 'graphics', 'suite', '11', '50', '']
Label: 1
%% Cell type:markdown id: tags:
## TASK 1: Implement `fit`
Implement the "fit" and "predict" methods for Naive Bayes. Use $m$-estimate to address missing attribute values (also called **Laplace smoothing** when $m$ = 1). In general, $m$ values should be small. We'll use $m$ = 1. We'll use log probabilities to avoid overflow.
%% Cell type:code id: tags:
``` python
class Naive_Bayes(Model):
def __init__(self, m):
'''
Args:
m: int
Specifies the smoothing parameter
'''
self.m = m
def fit(self, X, Y):
'''
Args:
X: list
list of lists where each entry is a tokenized email text
Y: ndarray
1D array of true labels. 1: spam, 0: ham
'''
#TODO
# Replace Nones, empty lists and dictionaries below
# List containing all distinct words in all emails
# A list might not be the best data structure for obtaining
# the vocabulary.
# Use a temporary more efficient data structure
# then populate self.vocabulary.
self.vocabulary = []
# find *log* class prior probabilities
self.prior = {'spam': None, 'ham': None}
# find the number of words(counting repeats) summed across all emails in a class
self.total_count = {'spam': None, 'ham': None}
# find the number of each word summed across all emails in a class
# populate self.word_counts
# self.word_counts['spam'] is a dict with words as keys.
self.word_counts = {'spam': {}, 'ham': {}}
def predict(self, X):
'''
Args:
X: list
list of lists where each entry is a tokenized email text
Returns:
probs: ndarray
mx2 array containing unnormalized log posteriors for spam and ham (for grading purposes)
preds: ndarray
1D binary array containing predicted labels
'''
preds = []
probs = []
#TODO
# use the attributes calculated in fit to compute unnormalized class posterior probabilities
# and predicted labels
raise NotImplementedError
return np.array(probs), np.array(preds)
```
%% Cell type:markdown id: tags:
### Rubric:
* correct vocabulary length +5,+5
* correct log class priors +10, +10
* correct word counts for the 5 most frequent words in each class +10, +10
%% Cell type:markdown id: tags:
### Test `fit`
%% Cell type:code id: tags:
``` python
# initialize the model
my_model = Naive_Bayes(...)
# pass the training emails and labels
my_model.fit(X=..., Y=...)
# display the most frequent 5 words in both classes
for cl in ['ham', 'spam']:
srt = sorted(my_model.word_counts[cl].items() , key=lambda x: x[1], reverse=True)
print('\n{} log prior: {}'.format(cl, my_model.prior[cl]))
print('5 most frequent words:')
print(srt[:5])
print('\nVocabulary has {} distinct words'.format(len(my_model.vocabulary)))
```
%% Cell type:markdown id: tags:
## TASK 2: Implement `predict`
Print the unnormalized log posteriors for the first five examples. We can use the "conf_matrix" function to see how error is distributed.
### Rubric:
* Correct unnormalized log posteriors +20, +20
* Correct confusion matrix +5, +5
### Test `predict`
%% Cell type:code id: tags:
``` python
probs, preds = my_model.predict(X=...)
print ('\nUnnormalized log posteriors of first 5 test examples:')
print (probs [:5])
tp,tn, fp, fn = conf_matrix(true = ..., pred = preds)
print('tp: {}, tn: {}, fp: {}, fn:{}'.format(tp, tn, fp, fn))
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment