Skip to content
Snippets Groups Projects
ProgrammingAssignment3_NB.ipynb 22.58 KiB

Naive Bayes Spam Classifier

In this part of the assignment, we will

  • implement a Naive Bayes spam classifier
    • address sparse data problem with pseudocounts ($m$-estimate)

A skeleton of a general supervised learning model is provided in "model.ipynb". We won't make any implementations for this part of homework 3. A confusion matrix implementation is provided for you in "model.ipynb".

Note:

You are not required to follow this exact template. You can change what parameters your functions take or partition the tasks across functions differently. However, make sure there are outputs and implementation for items listed in the rubric for each task. Also, indicate in code with comments which task you are attempting.

GRADING

You will be graded on parts that are marked with \#TODO comments. Read the comments in the code to make sure you don't miss any.

Mandatory for 478 & 878:

Tasks 478 878
1 Implement fit method 25 25
2 Implement predict method 25 25

Points are broken down further below in Rubric sections. The first score is for 478, the second is for 878 students. There are a total of 50 points in this part of assignment 2 for both 478 and 878 students.

In [1]:
import numpy as np
import json
%run 'model.ipynb'

We will use the Enron spam/ham dataset for spam filtering. The emails are already tokenized. Below code reads in the processed data. There are 33,702 emails: 17157 spam and 16545 ham.

Please do not change the order of the test indices as you'll be graded on results for the first 5 test examples.

In [2]:
# load tokenized email texts
with open('../data/enron_text.json') as f:
    X = json.load(f)
# load email labels
with open('../data/enron_label.json') as f:
    Y = json.load(f)
with open('../data/enron_split.json') as f:
    indices = json.load(f)
    
indices.keys()
Out [2]:
dict_keys(['training', 'test'])

X is a list of lists. Each of its 33,702 entries correspond to a tokenized email. Y is a list. Each entry is a label for the tokenized email at the corresponding position in X. Below are 5 random emails and their corresponding labels.

In [15]:
# number of emails
size = len(Y)
# randomly select and print some emails
ind_list = np.random.choice(size, 5)
for ind in ind_list:
    print('Tokenized text: {}'.format(X[ind]))
    print('Label: {}'.format(Y[ind]))
Out [15]:
Tokenized text: ['subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'tim', 'belden', 's', 'office', 'referred', 'me', 'to', 'you', 'has', 'grant', 'masson', 'been', 'replaced', 'original', 'message', 'from', 'vince', 'j', 'kaminski', 'enron', 'com', 'mailto', 'vince', 'j', 'kaminski', 'enron', 'com', 'sent', 'monday', 'november', '06', '2000', '11', '00', 'am', 'to', 'niam', 'infocastinc', 'com', 'subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'nia', 'please', 'contact', 'tim', 'belden', 'in', 'our', 'portland', 'office', 'his', 'phone', 'number', 'is', '503', '464', '3820', 'vince', 'nia', 'mansell', 'on', '11', '03', '2000', '12', '46', '52', 'pm', 'to', 'vkamins', 'enron', 'com', 'cc', 'subject', 'power', 'crisis', 'in', 'the', 'west', 'dear', 'vince', 'i', 'spoke', 'with', 'you', 'briefly', 'yesterday', 'regarding', 'grant', 'masson', 'you', 'informed', 'me', 'that', 'he', 'is', 'no', 'longer', 'an', 'enron', 'employee', 'i', 'have', 'also', 'been', 'informed', 'that', 'grant', 'has', 'not', 'yet', 'been', 'replaced', 'i', 'am', 'inquiring', 'because', 'infocast', 'would', 'like', 'to', 'have', 'an', 'enron', 'representative', 'speak', 'at', 'an', 'upcoming', 'conference', 'entitled', 'power', 'crisis', 'in', 'the', 'west', 'status', 'it', 'is', 'certainly', 'going', 'to', 'be', 'an', 'exciting', 'conference', 'due', 'to', 'all', 'of', 'the', 'controversy', 'surrounding', 'the', 'situation', 'in', 'san', 'diego', 'kind', 'regards', 'nia', 'mansell', 'infocast', 'conference', 'manager', '818', '888', '4445', 'ext', '45', '818', '888', '4440', 'fax', 'niam', 'com', 'see', 'attached', 'file', 'power', 'crisis', 'in', 'the', 'west', 'invite', 'doc', '']
Label: 0
Tokenized text: ['subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'fyi', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '18', '2001', '03', '07', 'pm', 'angela', 'dorsey', 'on', '01', '18', '2001', '02', '53', '59', 'pm', 'to', 'cc', 'subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'karen', 'thanks', 'for', 'the', 'extra', 'support', 'in', 'getting', 'the', 'word', 'out', 'i', 've', 'had', 'a', 'couple', 'rsvp', 's', 'from', 'enron', 'sincerely', 'angela', 'original', 'message', 'from', 'karen', 'marshall', 'enron', 'com', 'mailto', 'karen', 'marshall', 'enron', 'com', 'sent', 'wednesday', 'january', '17', '2001', '7', '59', 'pm', 'to', 'david', 'haug', 'enron', 'com', 'gary', 'hickerson', 'enron', 'com', 'cchilde', 'enron', 'com', 'thomas', 'suffield', 'enron', 'com', 'ben', 'f', 'glisan', 'enron', 'com', 'ermes', 'melinchon', 'enron', 'com', 'hal', 'elrod', 'enron', 'com', 'clay', 'spears', 'enron', 'com', 'kelly', 'mahmoud', 'enron', 'com', 'ellen', 'fowler', 'enron', 'com', 'kevin', 'kuykendall', 'enron', 'com', 'fred', 'mitro', 'enron', 'com', 'kyle', 'kettler', 'enron', 'com', 'jeff', 'bartlett', 'enron', 'com', 'paul', 'j', 'broderick', 'enron', 'com', 'john', 'house', 'enron', 'com', 'george', 'mccormick', 'enron', 'com', 'guido', 'caranti', 'enron', 'com', 'ken', 'sissingh', 'enron', 'com', 'gwynn', 'gorsuch', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'shawn', 'cumberland', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'paul', 'burkhart', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'brian', 'kerrigan', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'christie', 'patrick', 'enron', 'com', 'michael', 'b', 'rosen', 'enron', 'com', 'cindy', 'derecskey', 'enron', 'com', 'cc', 'elyse', 'kalmans', 'enron', 'com', 'richard', 'causey', 'enron', 'com', 'sally', 'beck', 'enron', 'com', 'vince', 'j', 'kaminski', 'enron', 'com', 'jeffrey', 'a', 'shankman', 'enron', 'com', 'angela', 'dorsey', 'subject', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'the', '500', 'registration', 'fee', 'is', 'waived', 'for', 'any', 'enron', 'employee', 'who', 'wishes', 'to', 'attend', 'this', 'conference', 'because', 'of', 'our', 'relationship', 'with', 'the', 'school', 'please', 'forward', 'this', 'information', 'to', 'your', 'managers', 'and', 'staff', 'members', 'who', 'would', 'benefit', 'from', 'participating', 'in', 'this', 'important', 'conference', 'note', 'vince', 'kaminski', 'is', 'a', 'panellist', 'for', 'the', 'risk', 'management', 'session', '3', 'please', 'note', 'the', 'deadline', 'for', 'rsvp', 'hotel', 'reservations', 'is', 'monday', 'january', '22', 'nd', 'don', 't', 'miss', 'this', 'opportunity', 'should', 'you', 'have', 'any', 'questions', 'please', 'feel', 'free', 'to', 'contact', 'me', 'at', 'ext', '37632', 'karen', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '11', '2001', '07', '38', 'pm', 'angela', 'dorsey', 'on', '01', '10', '2001', '03', '06', '18', 'pm', 'to', 'angela', 'dorsey', 'cc', 'ehud', 'ronn', 'sheridan', 'titman', 'e', 'mail', 'subject', 'invitation', 'to', '2001', 'energy', 'finance', 'conference', 'the', 'university', 'of', 'texas', 'at', 'austin', 'colleagues', 'and', 'friends', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'cefer', 'happy', 'new', 'year', 'hope', 'you', 'all', 'had', 'a', 'wonderful', 'holiday', 'season', 'on', 'behalf', 'of', 'the', 'university', 'of', 'texas', 'finance', 'department', 'and', 'cefer', 'we', 'would', 'like', 'to', 'cordially', 'invite', 'you', 'to', 'attend', 'our', '2001', 'energy', 'finance', 'conference', 'austin', 'texas', 'february', '22', '23', '2001', 'hosted', 'by', 'the', 'university', 'of', 'texas', 'finance', 'department', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'dr', 'ehud', 'i', 'ronn', 'and', 'dr', 'sheridan', 'titman', 'are', 'currently', 'in', 'the', 'process', 'of', 'finalizing', 'the', 'details', 'of', 'the', 'conference', 'agenda', 'we', 'have', 'listed', 'the', 'agenda', 'outline', 'below', 'to', 'assist', 'you', 'in', 'your', 'travel', 'planning', 'each', 'conference', 'session', 'will', 'be', 'composed', 'of', 'a', 'panel', 'discussion', 'between', '3', '4', 'guest', 'speakers', 'on', 'the', 'designated', 'topic', 'as', 'supporters', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'representatives', 'of', 'our', 'trustee', 'corporations', 'enron', 'el', 'paso', 'reliant', 'conoco', 'and', 'southern', 'will', 'have', 'the', '500', 'conference', 'fee', 'waived', 'the', 'conference', 'package', 'includes', 'thursday', 'evening', 's', 'cocktails', 'dinner', 'and', 'hotel', 'ut', 'shuttle', 'service', 'as', 'well', 'as', 'friday', 's', 'conference', 'meals', 'session', 'materials', 'and', 'shuttle', 'service', 'travel', 'to', 'austin', 'and', 'hotel', 'reservations', 'are', 'each', 'participant', 's', 'responsibility', 'a', 'limited', 'number', 'of', 'hotel', 'rooms', 'are', 'being', 'tentatively', 'held', 'at', 'the', 'radisson', 'hotel', 'on', 'town', 'lake', 'under', 'the', 'group', 'name', 'university', 'of', 'texas', 'finance', 'department', 'for', 'the', 'nights', 'of', 'thursday', '2', '22', '01', 'and', 'friday', '2', '23', '01', 'the', 'latter', 'evening', 'for', 'those', 'who', 'choose', 'to', 'stay', 'in', 'austin', 'after', 'the', 'conference', 's', 'conclusion', 'to', 'guarantee', 'room', 'reservations', 'you', 'will', 'need', 'to', 'contact', 'the', 'radisson', 'hotel', 'at', '512', '478', '9611', 'no', 'later', 'than', 'monday', 'january', '22', 'nd', 'and', 'make', 'your', 'reservations', 'with', 'a', 'credit', 'card', 'please', 'let', 'me', 'know', 'when', 'you', 'have', 'made', 'those', 'arrangements', 'so', 'that', 'i', 'can', 'make', 'sure', 'the', 'radisson', 'gives', 'you', 'the', 'special', 'room', 'rate', 'of', '129', 'night', 'please', 'rsvp', 'your', 'interest', 'in', 'attending', 'this', 'conference', 'no', 'later', 'than', 'january', '22', 'nd', 'to', 'angela', 'dorsey', 'bus', 'utexas', 'edu', 'or', '512', '232', '7386', 'as', 'seating', 'availability', 'is', 'limited', 'please', 'feel', 'free', 'to', 'extend', 'this', 'invitation', 'to', 'your', 'colleagues', 'who', 'might', 'be', 'interested', 'in', 'attending', 'this', 'conference', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'program', 'of', 'the', '2001', 'energy', 'finance', 'conference', 'february', '22', '23', '2001', 'thursday', 'feb', '22', '3', '00', 'p', 'm', 'reserved', 'rooms', 'at', 'the', 'radisson', 'hotel', 'available', 'for', 'check', 'in', '5', '30', 'p', 'm', 'bus', 'will', 'pick', 'up', 'guests', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', 'club', '6', '00', 'p', 'm', 'cocktails', 'ut', 'club', '9', 'th', 'floor', '7', '00', 'p', 'm', 'dinner', 'ut', 'club', '8', '00', 'p', 'm', 'keynote', 'speaker', '9', '00', 'p', 'm', 'bus', 'will', 'transport', 'guests', 'back', 'to', 'hotel', 'friday', 'feb', '23', '7', '45', 'a', 'm', 'bus', 'will', 'pick', 'up', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', '8', '30', 'a', 'm', 'session', '1', 'real', 'options', 'panelists', 'jim', 'dyer', 'ut', 'chair', 'sheridan', 'titman', 'ut', 'john', 'mccormack', 'stern', 'stewart', 'co', '10', '00', 'a', 'm', 'coffee', 'break', '10', '15', 'a', 'm', 'session', '2', 'deregulation', 'panelists', 'david', 'eaton', 'ut', 'chair', 'david', 'spence', 'ut', 'jeff', 'sandefer', 'sandefer', 'capital', 'partners', 'ut', 'peter', 'nance', 'teknecon', 'energy', 'risk', 'advisors', '11', '45', 'a', 'm', 'catered', 'lunch', 'keynote', 'speaker', '1', '30', 'p', 'm', 'guest', 'tour', 'eds', 'financial', 'trading', 'technology', 'center', '2', '00', 'p', 'm', 'session', '3', 'risk', 'management', 'panelists', 'keith', 'brown', 'ut', 'chair', 'vince', 'kaminski', 'enron', 'alexander', 'eydeland', 'southern', 'co', 'ehud', 'i', 'ronn', 'ut', '3', '30', 'p', 'm', 'snack', 'break', '3', '45', 'p', 'm', 'session', '4', 'globalization', 'of', 'the', 'energy', 'business', 'panelists', 'laura', 'starks', 'ut', 'chair', 'bob', 'goldman', 'conoco', 'ray', 'hill', 'southern', 'co', '5', '15', 'p', 'm', 'wrap', 'up', '5', '30', 'p', 'm', 'bus', 'picks', 'up', 'for', 'transport', 'to', 'airport', 'dinner', '6', '30', 'p', 'm', 'working', 'dinner', 'for', 'senior', 'officers', 'of', 'energy', 'finance', 'center', 'trustees', 'we', 'have', 'made', 'arrangements', 'to', 'provide', 'shuttle', 'service', 'between', 'the', 'radisson', 'hotel', 'and', 'ut', 'during', 'the', 'conference', 'however', 'if', 'you', 'choose', 'to', 'stay', 'at', 'an', 'alternative', 'hotel', 'then', 'transportation', 'to', 'conference', 'events', 'will', 'become', 'your', 'responsibility', 'angela', 'dorsey', 'assistant', 'director', 'center', 'for', 'energy', 'finance', 'education', 'research', 'the', 'university', 'of', 'texas', 'at', 'austin', 'department', 'of', 'finance', 'cba', '6', '222', 'austin', 'tx', '78712', 'angela', 'dorsey', 'bus', 'utexas', 'edu', '']
Label: 0
Tokenized text: ['subject', 'interview', 'jaesoo', 'lew', '10', '25', '00', 'attached', 'please', 'find', 'the', 'resume', 'interview', 'schedule', 'and', 'evaluation', 'form', 'for', 'jaesoo', 'lew', 'jaesoo', 'will', 'be', 'interviewing', 'with', 'vince', 'kaminski', 's', 'group', 'on', 'an', 'exploratory', 'basis', 'on', 'october', '25', '2000', 'please', 'contact', 'me', 'with', 'any', 'comments', 'or', 'concerns', 'thank', 'you', 'cheryl', 'arguijo', 'ena', 'recruiting', '713', '345', '4016']
Label: 0
Tokenized text: ['subject', 'custom', 'marketing', 'to', 'webmaster', 'ezmlm', 'org', 'email', 'is', 'the', 'best', 'promote', 'tool', 'we', 'offer', 'online', 'marketing', 'with', 'quality', 'service', '1', 'target', 'email', 'list', 'we', 'can', 'provide', 'target', 'email', 'list', 'you', 'need', 'which', 'are', 'compiled', 'only', 'on', 'your', 'order', 'we', 'will', 'customize', 'your', 'client', 'email', 'list', 'we', 'have', 'millions', 'of', 'lists', 'in', 'a', 'wide', 'variety', 'of', 'categories', '2', 'send', 'out', 'target', 'list', 'for', 'you', 'we', 'can', 'send', 'your', 'email', 'message', 'to', 'your', 'target', 'clients', 'we', 'will', 'customize', 'your', 'email', 'list', 'and', 'send', 'your', 'message', 'for', 'you', 'our', 'site', 'www', 'marketingforus', 'com', 'we', 'also', 'offer', 'web', 'hosting', 'mail', 'server', 'regards', 'jason', 'marketing', 'support', 'sales', 'marketingforus', 'com', 'no', 'thanks', 'byebye', 'msn', 'com', 'subject', 'webmaster', 'ezmlm', 'org']
Label: 1
Tokenized text: ['subject', 'popular', 'software', 'at', 'low', 'low', 'prices', 'alaina', 'windows', 'xp', 'professional', '2002', '50', 'adobe', 'photoshop', '7', '0', '50', 'microsoft', 'office', 'xp', 'professional', '2002', '50', 'corel', 'draw', 'graphics', 'suite', '11', '50', '']
Label: 1

TASK 1: Implement fit

Implement the "fit" and "predict" methods for Naive Bayes. Use -estimate to address missing attribute values (also called Laplace smoothing when = 1). In general, values should be small. We'll use = 1. We'll use log probabilities to avoid overflow.

class Naive_Bayes(Model):
    def __init__(self, m):
        '''
        Args:
            m: int
                Specifies the smoothing parameter
        '''
        self.m = m
    
    def fit(self, X, Y):
        '''
        Args:
            X: list
                list of lists where each entry is a tokenized email text
            Y: ndarray
                1D array of true labels. 1: spam, 0: ham
        '''
        
        #TODO
        
        # Replace Nones, empty lists and dictionaries below
        
        # List containing all distinct words in all emails
        # A list might not be the best data structure for obtaining
        # the vocabulary.
        # Use a temporary more efficient data structure 
        # then populate self.vocabulary.
        
        self.vocabulary = []
        
        # find *log* class prior probabilities
        self.prior = {'spam': None, 'ham': None}
        
        # find the number of words(counting repeats) summed across all emails in a class
        self.total_count = {'spam': None, 'ham': None}
        
        # find the number of each word summed across all emails in a class
        # populate self.word_counts
        # self.word_counts['spam'] is a dict with words as keys.
        self.word_counts = {'spam': {}, 'ham': {}}
        
        
    
    def predict(self, X):
        '''
        Args:
            X: list
                list of lists where each entry is a tokenized email text
        Returns:            
            probs: ndarray
                mx2 array containing unnormalized log posteriors for spam and ham (for grading purposes)
            preds: ndarray
                1D binary array containing predicted labels
        '''
        preds = []
        probs = []
        
        #TODO
        
        # use the attributes calculated in fit to compute unnormalized class posterior probabilities
        # and predicted labels
        
        raise NotImplementedError
        
        return np.array(probs), np.array(preds)
        

Rubric:

  • correct vocabulary length +5,+5
  • correct log class priors +10, +10
  • correct word counts for the 5 most frequent words in each class +10, +10

Test fit

# initialize the model
my_model = Naive_Bayes(...)

# pass the training emails and labels
my_model.fit(X=..., Y=...)

# display the most frequent 5 words in both classes
for cl in ['ham', 'spam']:
    srt = sorted(my_model.word_counts[cl].items() ,  key=lambda x: x[1], reverse=True)
    print('\n{} log prior: {}'.format(cl, my_model.prior[cl]))
    print('5 most frequent words:')
    print(srt[:5])
print('\nVocabulary has {} distinct words'.format(len(my_model.vocabulary)))

TASK 2: Implement predict

Print the unnormalized log posteriors for the first five examples. We can use the "conf_matrix" function to see how error is distributed.

Rubric:

  • Correct unnormalized log posteriors +20, +20
  • Correct confusion matrix +5, +5

Test predict

probs, preds = my_model.predict(X=...)
print ('\nUnnormalized log posteriors of first 5 test examples:')
print (probs [:5])
tp,tn, fp, fn = conf_matrix(true = ..., pred = preds)
print('tp: {}, tn: {}, fp: {}, fn:{}'.format(tp, tn, fp, fn))