Naive Bayes

a61be66a · Zeynep Hakguder · 86912ed1 · a61be66a
Commit a61be66a authored 6 years ago by Zeynep Hakguder
--- a/ProgrammingAssignment_3/ProgrammingAssignment3_NB.ipynb
+++ b/ProgrammingAssignment_3/ProgrammingAssignment3_NB.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Naive Bayes Spam Classifier\n",
+    "\n",
+    "In this part of the assignment, we will\n",
+    "\n",
+    "* implement a Naive Bayes spam classifier\n",
+    "    * address sparse data problem with **pseudocounts** (**$m$-estimate**)\n",
+    "    \n",
+    "A skeleton of a general supervised learning model is provided in \"model.ipynb\". We won't make any implementations for this part of homework 2. A confusion matrix implementation is provided for you in \"model.ipynb\".\n",
+    "\n",
+    "### Note:\n",
+    "\n",
+    "You are not required to follow this exact template. You can change what parameters your functions take or partition the tasks across functions differently. However, make sure there are outputs and implementation for items listed in the rubric for each task. Also, indicate in code with comments which task you are attempting."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# GRADING\n",
+    "\n",
+    "You will be graded on parts that are marked with **\\#TODO** comments. Read the comments in the code to make sure you don't miss any.\n",
+    "\n",
+    "### Mandatory for 478 & 878:\n",
+    "\n",
+    "|   | Tasks                      | 478 | 878 |\n",
+    "|---|----------------------------|-----|-----|\n",
+    "| 1 | Implement `fit` method     | 25  |  25 |\n",
+    "| 2 | Implement `predict` method |  25 |  25 |\n",
+    "\n",
+    "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There are a total of 50 points in this part of assignment 2 for both 478 and 878 students."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import json\n",
+    "%run 'model.ipynb'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will use the Enron spam/ham dataset for spam filtering. The emails are already tokenized. Below code reads in the processed data. There are 33,702 emails: 17157 spam and 16545 ham. \n",
+    "\n",
+    "**Please do not change the order of the test indices as you'll be graded on results for the first 5 test examples.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "dict_keys(['training', 'test'])"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# load tokenized email texts\n",
+    "with open('../data/enron_text.json') as f:\n",
+    "    X = json.load(f)\n",
+    "# load email labels\n",
+    "with open('../data/enron_label.json') as f:\n",
+    "    Y = json.load(f)\n",
+    "with open('../data/enron_split.json') as f:\n",
+    "    indices = json.load(f)\n",
+    "    \n",
+    "indices.keys()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "X is a list of lists. Each of its 33,702 entries correspond to a tokenized email. Y is a list. Each entry is a label for the tokenized email at the corresponding position in X. Below are 5 random emails and their corresponding labels."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Tokenized text: ['subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'tim', 'belden', 's', 'office', 'referred', 'me', 'to', 'you', 'has', 'grant', 'masson', 'been', 'replaced', 'original', 'message', 'from', 'vince', 'j', 'kaminski', 'enron', 'com', 'mailto', 'vince', 'j', 'kaminski', 'enron', 'com', 'sent', 'monday', 'november', '06', '2000', '11', '00', 'am', 'to', 'niam', 'infocastinc', 'com', 'subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'nia', 'please', 'contact', 'tim', 'belden', 'in', 'our', 'portland', 'office', 'his', 'phone', 'number', 'is', '503', '464', '3820', 'vince', 'nia', 'mansell', 'on', '11', '03', '2000', '12', '46', '52', 'pm', 'to', 'vkamins', 'enron', 'com', 'cc', 'subject', 'power', 'crisis', 'in', 'the', 'west', 'dear', 'vince', 'i', 'spoke', 'with', 'you', 'briefly', 'yesterday', 'regarding', 'grant', 'masson', 'you', 'informed', 'me', 'that', 'he', 'is', 'no', 'longer', 'an', 'enron', 'employee', 'i', 'have', 'also', 'been', 'informed', 'that', 'grant', 'has', 'not', 'yet', 'been', 'replaced', 'i', 'am', 'inquiring', 'because', 'infocast', 'would', 'like', 'to', 'have', 'an', 'enron', 'representative', 'speak', 'at', 'an', 'upcoming', 'conference', 'entitled', 'power', 'crisis', 'in', 'the', 'west', 'status', 'it', 'is', 'certainly', 'going', 'to', 'be', 'an', 'exciting', 'conference', 'due', 'to', 'all', 'of', 'the', 'controversy', 'surrounding', 'the', 'situation', 'in', 'san', 'diego', 'kind', 'regards', 'nia', 'mansell', 'infocast', 'conference', 'manager', '818', '888', '4445', 'ext', '45', '818', '888', '4440', 'fax', 'niam', 'com', 'see', 'attached', 'file', 'power', 'crisis', 'in', 'the', 'west', 'invite', 'doc', '']\n",
+      "Label: 0\n",
+      "Tokenized text: ['subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'fyi', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '18', '2001', '03', '07', 'pm', 'angela', 'dorsey', 'on', '01', '18', '2001', '02', '53', '59', 'pm', 'to', 'cc', 'subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'karen', 'thanks', 'for', 'the', 'extra', 'support', 'in', 'getting', 'the', 'word', 'out', 'i', 've', 'had', 'a', 'couple', 'rsvp', 's', 'from', 'enron', 'sincerely', 'angela', 'original', 'message', 'from', 'karen', 'marshall', 'enron', 'com', 'mailto', 'karen', 'marshall', 'enron', 'com', 'sent', 'wednesday', 'january', '17', '2001', '7', '59', 'pm', 'to', 'david', 'haug', 'enron', 'com', 'gary', 'hickerson', 'enron', 'com', 'cchilde', 'enron', 'com', 'thomas', 'suffield', 'enron', 'com', 'ben', 'f', 'glisan', 'enron', 'com', 'ermes', 'melinchon', 'enron', 'com', 'hal', 'elrod', 'enron', 'com', 'clay', 'spears', 'enron', 'com', 'kelly', 'mahmoud', 'enron', 'com', 'ellen', 'fowler', 'enron', 'com', 'kevin', 'kuykendall', 'enron', 'com', 'fred', 'mitro', 'enron', 'com', 'kyle', 'kettler', 'enron', 'com', 'jeff', 'bartlett', 'enron', 'com', 'paul', 'j', 'broderick', 'enron', 'com', 'john', 'house', 'enron', 'com', 'george', 'mccormick', 'enron', 'com', 'guido', 'caranti', 'enron', 'com', 'ken', 'sissingh', 'enron', 'com', 'gwynn', 'gorsuch', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'shawn', 'cumberland', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'paul', 'burkhart', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'brian', 'kerrigan', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'christie', 'patrick', 'enron', 'com', 'michael', 'b', 'rosen', 'enron', 'com', 'cindy', 'derecskey', 'enron', 'com', 'cc', 'elyse', 'kalmans', 'enron', 'com', 'richard', 'causey', 'enron', 'com', 'sally', 'beck', 'enron', 'com', 'vince', 'j', 'kaminski', 'enron', 'com', 'jeffrey', 'a', 'shankman', 'enron', 'com', 'angela', 'dorsey', 'subject', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'the', '500', 'registration', 'fee', 'is', 'waived', 'for', 'any', 'enron', 'employee', 'who', 'wishes', 'to', 'attend', 'this', 'conference', 'because', 'of', 'our', 'relationship', 'with', 'the', 'school', 'please', 'forward', 'this', 'information', 'to', 'your', 'managers', 'and', 'staff', 'members', 'who', 'would', 'benefit', 'from', 'participating', 'in', 'this', 'important', 'conference', 'note', 'vince', 'kaminski', 'is', 'a', 'panellist', 'for', 'the', 'risk', 'management', 'session', '3', 'please', 'note', 'the', 'deadline', 'for', 'rsvp', 'hotel', 'reservations', 'is', 'monday', 'january', '22', 'nd', 'don', 't', 'miss', 'this', 'opportunity', 'should', 'you', 'have', 'any', 'questions', 'please', 'feel', 'free', 'to', 'contact', 'me', 'at', 'ext', '37632', 'karen', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '11', '2001', '07', '38', 'pm', 'angela', 'dorsey', 'on', '01', '10', '2001', '03', '06', '18', 'pm', 'to', 'angela', 'dorsey', 'cc', 'ehud', 'ronn', 'sheridan', 'titman', 'e', 'mail', 'subject', 'invitation', 'to', '2001', 'energy', 'finance', 'conference', 'the', 'university', 'of', 'texas', 'at', 'austin', 'colleagues', 'and', 'friends', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'cefer', 'happy', 'new', 'year', 'hope', 'you', 'all', 'had', 'a', 'wonderful', 'holiday', 'season', 'on', 'behalf', 'of', 'the', 'university', 'of', 'texas', 'finance', 'department', 'and', 'cefer', 'we', 'would', 'like', 'to', 'cordially', 'invite', 'you', 'to', 'attend', 'our', '2001', 'energy', 'finance', 'conference', 'austin', 'texas', 'february', '22', '23', '2001', 'hosted', 'by', 'the', 'university', 'of', 'texas', 'finance', 'department', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'dr', 'ehud', 'i', 'ronn', 'and', 'dr', 'sheridan', 'titman', 'are', 'currently', 'in', 'the', 'process', 'of', 'finalizing', 'the', 'details', 'of', 'the', 'conference', 'agenda', 'we', 'have', 'listed', 'the', 'agenda', 'outline', 'below', 'to', 'assist', 'you', 'in', 'your', 'travel', 'planning', 'each', 'conference', 'session', 'will', 'be', 'composed', 'of', 'a', 'panel', 'discussion', 'between', '3', '4', 'guest', 'speakers', 'on', 'the', 'designated', 'topic', 'as', 'supporters', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'representatives', 'of', 'our', 'trustee', 'corporations', 'enron', 'el', 'paso', 'reliant', 'conoco', 'and', 'southern', 'will', 'have', 'the', '500', 'conference', 'fee', 'waived', 'the', 'conference', 'package', 'includes', 'thursday', 'evening', 's', 'cocktails', 'dinner', 'and', 'hotel', 'ut', 'shuttle', 'service', 'as', 'well', 'as', 'friday', 's', 'conference', 'meals', 'session', 'materials', 'and', 'shuttle', 'service', 'travel', 'to', 'austin', 'and', 'hotel', 'reservations', 'are', 'each', 'participant', 's', 'responsibility', 'a', 'limited', 'number', 'of', 'hotel', 'rooms', 'are', 'being', 'tentatively', 'held', 'at', 'the', 'radisson', 'hotel', 'on', 'town', 'lake', 'under', 'the', 'group', 'name', 'university', 'of', 'texas', 'finance', 'department', 'for', 'the', 'nights', 'of', 'thursday', '2', '22', '01', 'and', 'friday', '2', '23', '01', 'the', 'latter', 'evening', 'for', 'those', 'who', 'choose', 'to', 'stay', 'in', 'austin', 'after', 'the', 'conference', 's', 'conclusion', 'to', 'guarantee', 'room', 'reservations', 'you', 'will', 'need', 'to', 'contact', 'the', 'radisson', 'hotel', 'at', '512', '478', '9611', 'no', 'later', 'than', 'monday', 'january', '22', 'nd', 'and', 'make', 'your', 'reservations', 'with', 'a', 'credit', 'card', 'please', 'let', 'me', 'know', 'when', 'you', 'have', 'made', 'those', 'arrangements', 'so', 'that', 'i', 'can', 'make', 'sure', 'the', 'radisson', 'gives', 'you', 'the', 'special', 'room', 'rate', 'of', '129', 'night', 'please', 'rsvp', 'your', 'interest', 'in', 'attending', 'this', 'conference', 'no', 'later', 'than', 'january', '22', 'nd', 'to', 'angela', 'dorsey', 'bus', 'utexas', 'edu', 'or', '512', '232', '7386', 'as', 'seating', 'availability', 'is', 'limited', 'please', 'feel', 'free', 'to', 'extend', 'this', 'invitation', 'to', 'your', 'colleagues', 'who', 'might', 'be', 'interested', 'in', 'attending', 'this', 'conference', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'program', 'of', 'the', '2001', 'energy', 'finance', 'conference', 'february', '22', '23', '2001', 'thursday', 'feb', '22', '3', '00', 'p', 'm', 'reserved', 'rooms', 'at', 'the', 'radisson', 'hotel', 'available', 'for', 'check', 'in', '5', '30', 'p', 'm', 'bus', 'will', 'pick', 'up', 'guests', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', 'club', '6', '00', 'p', 'm', 'cocktails', 'ut', 'club', '9', 'th', 'floor', '7', '00', 'p', 'm', 'dinner', 'ut', 'club', '8', '00', 'p', 'm', 'keynote', 'speaker', '9', '00', 'p', 'm', 'bus', 'will', 'transport', 'guests', 'back', 'to', 'hotel', 'friday', 'feb', '23', '7', '45', 'a', 'm', 'bus', 'will', 'pick', 'up', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', '8', '30', 'a', 'm', 'session', '1', 'real', 'options', 'panelists', 'jim', 'dyer', 'ut', 'chair', 'sheridan', 'titman', 'ut', 'john', 'mccormack', 'stern', 'stewart', 'co', '10', '00', 'a', 'm', 'coffee', 'break', '10', '15', 'a', 'm', 'session', '2', 'deregulation', 'panelists', 'david', 'eaton', 'ut', 'chair', 'david', 'spence', 'ut', 'jeff', 'sandefer', 'sandefer', 'capital', 'partners', 'ut', 'peter', 'nance', 'teknecon', 'energy', 'risk', 'advisors', '11', '45', 'a', 'm', 'catered', 'lunch', 'keynote', 'speaker', '1', '30', 'p', 'm', 'guest', 'tour', 'eds', 'financial', 'trading', 'technology', 'center', '2', '00', 'p', 'm', 'session', '3', 'risk', 'management', 'panelists', 'keith', 'brown', 'ut', 'chair', 'vince', 'kaminski', 'enron', 'alexander', 'eydeland', 'southern', 'co', 'ehud', 'i', 'ronn', 'ut', '3', '30', 'p', 'm', 'snack', 'break', '3', '45', 'p', 'm', 'session', '4', 'globalization', 'of', 'the', 'energy', 'business', 'panelists', 'laura', 'starks', 'ut', 'chair', 'bob', 'goldman', 'conoco', 'ray', 'hill', 'southern', 'co', '5', '15', 'p', 'm', 'wrap', 'up', '5', '30', 'p', 'm', 'bus', 'picks', 'up', 'for', 'transport', 'to', 'airport', 'dinner', '6', '30', 'p', 'm', 'working', 'dinner', 'for', 'senior', 'officers', 'of', 'energy', 'finance', 'center', 'trustees', 'we', 'have', 'made', 'arrangements', 'to', 'provide', 'shuttle', 'service', 'between', 'the', 'radisson', 'hotel', 'and', 'ut', 'during', 'the', 'conference', 'however', 'if', 'you', 'choose', 'to', 'stay', 'at', 'an', 'alternative', 'hotel', 'then', 'transportation', 'to', 'conference', 'events', 'will', 'become', 'your', 'responsibility', 'angela', 'dorsey', 'assistant', 'director', 'center', 'for', 'energy', 'finance', 'education', 'research', 'the', 'university', 'of', 'texas', 'at', 'austin', 'department', 'of', 'finance', 'cba', '6', '222', 'austin', 'tx', '78712', 'angela', 'dorsey', 'bus', 'utexas', 'edu', '']\n",
+      "Label: 0\n",
+      "Tokenized text: ['subject', 'interview', 'jaesoo', 'lew', '10', '25', '00', 'attached', 'please', 'find', 'the', 'resume', 'interview', 'schedule', 'and', 'evaluation', 'form', 'for', 'jaesoo', 'lew', 'jaesoo', 'will', 'be', 'interviewing', 'with', 'vince', 'kaminski', 's', 'group', 'on', 'an', 'exploratory', 'basis', 'on', 'october', '25', '2000', 'please', 'contact', 'me', 'with', 'any', 'comments', 'or', 'concerns', 'thank', 'you', 'cheryl', 'arguijo', 'ena', 'recruiting', '713', '345', '4016']\n",
+      "Label: 0\n",
+      "Tokenized text: ['subject', 'custom', 'marketing', 'to', 'webmaster', 'ezmlm', 'org', 'email', 'is', 'the', 'best', 'promote', 'tool', 'we', 'offer', 'online', 'marketing', 'with', 'quality', 'service', '1', 'target', 'email', 'list', 'we', 'can', 'provide', 'target', 'email', 'list', 'you', 'need', 'which', 'are', 'compiled', 'only', 'on', 'your', 'order', 'we', 'will', 'customize', 'your', 'client', 'email', 'list', 'we', 'have', 'millions', 'of', 'lists', 'in', 'a', 'wide', 'variety', 'of', 'categories', '2', 'send', 'out', 'target', 'list', 'for', 'you', 'we', 'can', 'send', 'your', 'email', 'message', 'to', 'your', 'target', 'clients', 'we', 'will', 'customize', 'your', 'email', 'list', 'and', 'send', 'your', 'message', 'for', 'you', 'our', 'site', 'www', 'marketingforus', 'com', 'we', 'also', 'offer', 'web', 'hosting', 'mail', 'server', 'regards', 'jason', 'marketing', 'support', 'sales', 'marketingforus', 'com', 'no', 'thanks', 'byebye', 'msn', 'com', 'subject', 'webmaster', 'ezmlm', 'org']\n",
+      "Label: 1\n",
+      "Tokenized text: ['subject', 'popular', 'software', 'at', 'low', 'low', 'prices', 'alaina', 'windows', 'xp', 'professional', '2002', '50', 'adobe', 'photoshop', '7', '0', '50', 'microsoft', 'office', 'xp', 'professional', '2002', '50', 'corel', 'draw', 'graphics', 'suite', '11', '50', '']\n",
+      "Label: 1\n"
+     ]
+    }
+   ],
+   "source": [
+    "# number of emails\n",
+    "size = len(Y)\n",
+    "# randomly select and print some emails\n",
+    "ind_list = np.random.choice(size, 5)\n",
+    "for ind in ind_list:\n",
+    "    print('Tokenized text: {}'.format(X[ind]))\n",
+    "    print('Label: {}'.format(Y[ind]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## TASK 1: Implement `fit`\n",
+    "Implement the \"fit\" and \"predict\" methods for Naive Bayes. Use $m$-estimate to address missing attribute values (also called **Laplace smoothing** when $m$ = 1). In general, $m$ values should be small. We'll use $m$ = 1. We'll use log probabilities to avoid overflow."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Naive_Bayes(Model):\n",
+    "    def __init__(self, m):\n",
+    "        '''\n",
+    "        Args:\n",
+    "            m: int\n",
+    "                Specifies the smoothing parameter\n",
+    "        '''\n",
+    "        self.m = m\n",
+    "    \n",
+    "    def fit(self, X, Y):\n",
+    "        '''\n",
+    "        Args:\n",
+    "            X: list\n",
+    "                list of lists where each entry is a tokenized email text\n",
+    "            Y: ndarray\n",
+    "                1D array of true labels. 1: spam, 0: ham\n",
+    "        '''\n",
+    "        \n",
+    "        #TODO\n",
+    "        \n",
+    "        # Replace Nones, empty lists and dictionaries below\n",
+    "        \n",
+    "        # List containing all distinct words in all emails\n",
+    "        # A list might not be the best data structure for obtaining\n",
+    "        # the vocabulary.\n",
+    "        # Use a temporary more efficient data structure \n",
+    "        # then populate self.vocabulary.\n",
+    "        \n",
+    "        self.vocabulary = []\n",
+    "        \n",
+    "        # find *log* class prior probabilities\n",
+    "        self.prior = {'spam': None, 'ham': None}\n",
+    "        \n",
+    "        # find the number of words(counting repeats) summed across all emails in a class\n",
+    "        self.total_count = {'spam': None, 'ham': None}\n",
+    "        \n",
+    "        # find the number of each word summed across all emails in a class\n",
+    "        # populate self.word_counts\n",
+    "        # self.word_counts['spam'] is a dict with words as keys.\n",
+    "        self.word_counts = {'spam': {}, 'ham': {}}\n",
+    "        \n",
+    "        \n",
+    "    \n",
+    "    def predict(self, X):\n",
+    "        '''\n",
+    "        Args:\n",
+    "            X: list\n",
+    "                list of lists where each entry is a tokenized email text\n",
+    "        Returns:            \n",
+    "            probs: ndarray\n",
+    "                mx2 array containing unnormalized log posteriors for spam and ham (for grading purposes)\n",
+    "            preds: ndarray\n",
+    "                1D binary array containing predicted labels\n",
+    "        '''\n",
+    "        preds = []\n",
+    "        probs = []\n",
+    "        \n",
+    "        #TODO\n",
+    "        \n",
+    "        # use the attributes calculated in fit to compute unnormalized class posterior probabilities\n",
+    "        # and predicted labels\n",
+    "        \n",
+    "        raise NotImplementedError\n",
+    "        \n",
+    "        return np.array(probs), np.array(preds)\n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Rubric:\n",
+    "* correct vocabulary length +5,+5\n",
+    "* correct log class priors +10, +10\n",
+    "* correct word counts for the 5 most frequent words in each class +10, +10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test `fit`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# initialize the model\n",
+    "my_model = Naive_Bayes(...)\n",
+    "\n",
+    "# pass the training emails and labels\n",
+    "my_model.fit(X=..., Y=...)\n",
+    "\n",
+    "# display the most frequent 5 words in both classes\n",
+    "for cl in ['ham', 'spam']:\n",
+    "    srt = sorted(my_model.word_counts[cl].items() ,  key=lambda x: x[1], reverse=True)\n",
+    "    print('\\n{} log prior: {}'.format(cl, my_model.prior[cl]))\n",
+    "    print('5 most frequent words:')\n",
+    "    print(srt[:5])\n",
+    "print('\\nVocabulary has {} distinct words'.format(len(my_model.vocabulary)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## TASK 2: Implement `predict`\n",
+    "Print the unnormalized log posteriors for the first five examples. We can use the \"conf_matrix\" function to see how error is distributed.\n",
+    "\n",
+    "### Rubric:\n",
+    "* Correct unnormalized log posteriors +20, +20\n",
+    "* Correct confusion matrix +5, +5\n",
+    "\n",
+    "### Test `predict`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "probs, preds = my_model.predict(X=...)\n",
+    "print ('\\nUnnormalized log posteriors of first 5 test examples:')\n",
+    "print (probs [:5])\n",
+    "tp,tn, fp, fn = conf_matrix(true = ..., pred = preds)\n",
+    "print('tp: {}, tn: {}, fp: {}, fn:{}'.format(tp, tn, fp, fn))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Naive Bayes Spam Classifier
+
+In this part of the assignment, we will
+
+* implement a Naive Bayes spam classifier
+    * address sparse data problem with **pseudocounts** (**$m$-estimate**)
+
+A skeleton of a general supervised learning model is provided in "model.ipynb". We won't make any implementations for this part of homework 2. A confusion matrix implementation is provided for you in "model.ipynb".
+
+### Note:
+
+You are not required to follow this exact template. You can change what parameters your functions take or partition the tasks across functions differently. However, make sure there are outputs and implementation for items listed in the rubric for each task. Also, indicate in code with comments which task you are attempting.
+
+%% Cell type:markdown id: tags:
+
+# GRADING
+
+You will be graded on parts that are marked with **\#TODO** comments. Read the comments in the code to make sure you don't miss any.
+
+### Mandatory for 478 & 878:
+
+|   | Tasks                      | 478 | 878 |
+|---|----------------------------|-----|-----|
+| 1 | Implement `fit` method     | 25  |  25 |
+| 2 | Implement `predict` method |  25 |  25 |
+
+Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There are a total of 50 points in this part of assignment 2 for both 478 and 878 students.
+
+%% Cell type:code id: tags:
+
+``` python
+import numpy as np
+import json
+%run 'model.ipynb'
+```
+
+%% Cell type:markdown id: tags:
+
+We will use the Enron spam/ham dataset for spam filtering. The emails are already tokenized. Below code reads in the processed data. There are 33,702 emails: 17157 spam and 16545 ham.
+
+**Please do not change the order of the test indices as you'll be graded on results for the first 5 test examples.**
+
+%% Cell type:code id: tags:
+
+``` python
+# load tokenized email texts
+with open('../data/enron_text.json') as f:
+    X = json.load(f)
+# load email labels
+with open('../data/enron_label.json') as f:
+    Y = json.load(f)
+with open('../data/enron_split.json') as f:
+    indices = json.load(f)
+
+indices.keys()
+```
+
+%% Output
+
+    dict_keys(['training', 'test'])
+
+%% Cell type:markdown id: tags:
+
+X is a list of lists. Each of its 33,702 entries correspond to a tokenized email. Y is a list. Each entry is a label for the tokenized email at the corresponding position in X. Below are 5 random emails and their corresponding labels.
+
+%% Cell type:code id: tags:
+
+``` python
+# number of emails
+size = len(Y)
+# randomly select and print some emails
+ind_list = np.random.choice(size, 5)
+for ind in ind_list:
+    print('Tokenized text: {}'.format(X[ind]))
+    print('Label: {}'.format(Y[ind]))
+```
+
+%% Output
+
+    Tokenized text: ['subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'tim', 'belden', 's', 'office', 'referred', 'me', 'to', 'you', 'has', 'grant', 'masson', 'been', 'replaced', 'original', 'message', 'from', 'vince', 'j', 'kaminski', 'enron', 'com', 'mailto', 'vince', 'j', 'kaminski', 'enron', 'com', 'sent', 'monday', 'november', '06', '2000', '11', '00', 'am', 'to', 'niam', 'infocastinc', 'com', 'subject', 're', 'power', 'crisis', 'in', 'the', 'west', 'nia', 'please', 'contact', 'tim', 'belden', 'in', 'our', 'portland', 'office', 'his', 'phone', 'number', 'is', '503', '464', '3820', 'vince', 'nia', 'mansell', 'on', '11', '03', '2000', '12', '46', '52', 'pm', 'to', 'vkamins', 'enron', 'com', 'cc', 'subject', 'power', 'crisis', 'in', 'the', 'west', 'dear', 'vince', 'i', 'spoke', 'with', 'you', 'briefly', 'yesterday', 'regarding', 'grant', 'masson', 'you', 'informed', 'me', 'that', 'he', 'is', 'no', 'longer', 'an', 'enron', 'employee', 'i', 'have', 'also', 'been', 'informed', 'that', 'grant', 'has', 'not', 'yet', 'been', 'replaced', 'i', 'am', 'inquiring', 'because', 'infocast', 'would', 'like', 'to', 'have', 'an', 'enron', 'representative', 'speak', 'at', 'an', 'upcoming', 'conference', 'entitled', 'power', 'crisis', 'in', 'the', 'west', 'status', 'it', 'is', 'certainly', 'going', 'to', 'be', 'an', 'exciting', 'conference', 'due', 'to', 'all', 'of', 'the', 'controversy', 'surrounding', 'the', 'situation', 'in', 'san', 'diego', 'kind', 'regards', 'nia', 'mansell', 'infocast', 'conference', 'manager', '818', '888', '4445', 'ext', '45', '818', '888', '4440', 'fax', 'niam', 'com', 'see', 'attached', 'file', 'power', 'crisis', 'in', 'the', 'west', 'invite', 'doc', '']
+    Label: 0
+    Tokenized text: ['subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'fyi', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '18', '2001', '03', '07', 'pm', 'angela', 'dorsey', 'on', '01', '18', '2001', '02', '53', '59', 'pm', 'to', 'cc', 'subject', 're', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'karen', 'thanks', 'for', 'the', 'extra', 'support', 'in', 'getting', 'the', 'word', 'out', 'i', 've', 'had', 'a', 'couple', 'rsvp', 's', 'from', 'enron', 'sincerely', 'angela', 'original', 'message', 'from', 'karen', 'marshall', 'enron', 'com', 'mailto', 'karen', 'marshall', 'enron', 'com', 'sent', 'wednesday', 'january', '17', '2001', '7', '59', 'pm', 'to', 'david', 'haug', 'enron', 'com', 'gary', 'hickerson', 'enron', 'com', 'cchilde', 'enron', 'com', 'thomas', 'suffield', 'enron', 'com', 'ben', 'f', 'glisan', 'enron', 'com', 'ermes', 'melinchon', 'enron', 'com', 'hal', 'elrod', 'enron', 'com', 'clay', 'spears', 'enron', 'com', 'kelly', 'mahmoud', 'enron', 'com', 'ellen', 'fowler', 'enron', 'com', 'kevin', 'kuykendall', 'enron', 'com', 'fred', 'mitro', 'enron', 'com', 'kyle', 'kettler', 'enron', 'com', 'jeff', 'bartlett', 'enron', 'com', 'paul', 'j', 'broderick', 'enron', 'com', 'john', 'house', 'enron', 'com', 'george', 'mccormick', 'enron', 'com', 'guido', 'caranti', 'enron', 'com', 'ken', 'sissingh', 'enron', 'com', 'gwynn', 'gorsuch', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'shawn', 'cumberland', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'paul', 'burkhart', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'brian', 'kerrigan', 'enron', 'com', 'mark', 'gandy', 'enron', 'com', 'jennifer', 'martinez', 'enron', 'com', 'sean', 'keenan', 'enron', 'com', 'webb', 'jennings', 'enron', 'com', 'brian', 'hendon', 'enron', 'com', 'billy', 'braddock', 'enron', 'com', 'garrett', 'tripp', 'enron', 'com', 'john', 'massey', 'enron', 'com', 'v', 'charles', 'weldon', 'enron', 'com', 'phayes', 'enron', 'com', 'ross', 'mesquita', 'enron', 'com', 'david', 'mitchell', 'enron', 'com', 'christie', 'patrick', 'enron', 'com', 'michael', 'b', 'rosen', 'enron', 'com', 'cindy', 'derecskey', 'enron', 'com', 'cc', 'elyse', 'kalmans', 'enron', 'com', 'richard', 'causey', 'enron', 'com', 'sally', 'beck', 'enron', 'com', 'vince', 'j', 'kaminski', 'enron', 'com', 'jeffrey', 'a', 'shankman', 'enron', 'com', 'angela', 'dorsey', 'subject', 'urgent', 'deadline', 'rsvp', 'by', 'jan', '22', 'nd', 'invitation', 'to', '2001', 'energy', 'financeconference', 'feb', '22', '23', '2001', 'the', 'university', 'of', 'texas', 'at', 'austin', 'the', '500', 'registration', 'fee', 'is', 'waived', 'for', 'any', 'enron', 'employee', 'who', 'wishes', 'to', 'attend', 'this', 'conference', 'because', 'of', 'our', 'relationship', 'with', 'the', 'school', 'please', 'forward', 'this', 'information', 'to', 'your', 'managers', 'and', 'staff', 'members', 'who', 'would', 'benefit', 'from', 'participating', 'in', 'this', 'important', 'conference', 'note', 'vince', 'kaminski', 'is', 'a', 'panellist', 'for', 'the', 'risk', 'management', 'session', '3', 'please', 'note', 'the', 'deadline', 'for', 'rsvp', 'hotel', 'reservations', 'is', 'monday', 'january', '22', 'nd', 'don', 't', 'miss', 'this', 'opportunity', 'should', 'you', 'have', 'any', 'questions', 'please', 'feel', 'free', 'to', 'contact', 'me', 'at', 'ext', '37632', 'karen', 'forwarded', 'by', 'karen', 'marshall', 'hou', 'ect', 'on', '01', '11', '2001', '07', '38', 'pm', 'angela', 'dorsey', 'on', '01', '10', '2001', '03', '06', '18', 'pm', 'to', 'angela', 'dorsey', 'cc', 'ehud', 'ronn', 'sheridan', 'titman', 'e', 'mail', 'subject', 'invitation', 'to', '2001', 'energy', 'finance', 'conference', 'the', 'university', 'of', 'texas', 'at', 'austin', 'colleagues', 'and', 'friends', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'cefer', 'happy', 'new', 'year', 'hope', 'you', 'all', 'had', 'a', 'wonderful', 'holiday', 'season', 'on', 'behalf', 'of', 'the', 'university', 'of', 'texas', 'finance', 'department', 'and', 'cefer', 'we', 'would', 'like', 'to', 'cordially', 'invite', 'you', 'to', 'attend', 'our', '2001', 'energy', 'finance', 'conference', 'austin', 'texas', 'february', '22', '23', '2001', 'hosted', 'by', 'the', 'university', 'of', 'texas', 'finance', 'department', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'dr', 'ehud', 'i', 'ronn', 'and', 'dr', 'sheridan', 'titman', 'are', 'currently', 'in', 'the', 'process', 'of', 'finalizing', 'the', 'details', 'of', 'the', 'conference', 'agenda', 'we', 'have', 'listed', 'the', 'agenda', 'outline', 'below', 'to', 'assist', 'you', 'in', 'your', 'travel', 'planning', 'each', 'conference', 'session', 'will', 'be', 'composed', 'of', 'a', 'panel', 'discussion', 'between', '3', '4', 'guest', 'speakers', 'on', 'the', 'designated', 'topic', 'as', 'supporters', 'of', 'the', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'representatives', 'of', 'our', 'trustee', 'corporations', 'enron', 'el', 'paso', 'reliant', 'conoco', 'and', 'southern', 'will', 'have', 'the', '500', 'conference', 'fee', 'waived', 'the', 'conference', 'package', 'includes', 'thursday', 'evening', 's', 'cocktails', 'dinner', 'and', 'hotel', 'ut', 'shuttle', 'service', 'as', 'well', 'as', 'friday', 's', 'conference', 'meals', 'session', 'materials', 'and', 'shuttle', 'service', 'travel', 'to', 'austin', 'and', 'hotel', 'reservations', 'are', 'each', 'participant', 's', 'responsibility', 'a', 'limited', 'number', 'of', 'hotel', 'rooms', 'are', 'being', 'tentatively', 'held', 'at', 'the', 'radisson', 'hotel', 'on', 'town', 'lake', 'under', 'the', 'group', 'name', 'university', 'of', 'texas', 'finance', 'department', 'for', 'the', 'nights', 'of', 'thursday', '2', '22', '01', 'and', 'friday', '2', '23', '01', 'the', 'latter', 'evening', 'for', 'those', 'who', 'choose', 'to', 'stay', 'in', 'austin', 'after', 'the', 'conference', 's', 'conclusion', 'to', 'guarantee', 'room', 'reservations', 'you', 'will', 'need', 'to', 'contact', 'the', 'radisson', 'hotel', 'at', '512', '478', '9611', 'no', 'later', 'than', 'monday', 'january', '22', 'nd', 'and', 'make', 'your', 'reservations', 'with', 'a', 'credit', 'card', 'please', 'let', 'me', 'know', 'when', 'you', 'have', 'made', 'those', 'arrangements', 'so', 'that', 'i', 'can', 'make', 'sure', 'the', 'radisson', 'gives', 'you', 'the', 'special', 'room', 'rate', 'of', '129', 'night', 'please', 'rsvp', 'your', 'interest', 'in', 'attending', 'this', 'conference', 'no', 'later', 'than', 'january', '22', 'nd', 'to', 'angela', 'dorsey', 'bus', 'utexas', 'edu', 'or', '512', '232', '7386', 'as', 'seating', 'availability', 'is', 'limited', 'please', 'feel', 'free', 'to', 'extend', 'this', 'invitation', 'to', 'your', 'colleagues', 'who', 'might', 'be', 'interested', 'in', 'attending', 'this', 'conference', 'center', 'for', 'energy', 'finance', 'education', 'and', 'research', 'program', 'of', 'the', '2001', 'energy', 'finance', 'conference', 'february', '22', '23', '2001', 'thursday', 'feb', '22', '3', '00', 'p', 'm', 'reserved', 'rooms', 'at', 'the', 'radisson', 'hotel', 'available', 'for', 'check', 'in', '5', '30', 'p', 'm', 'bus', 'will', 'pick', 'up', 'guests', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', 'club', '6', '00', 'p', 'm', 'cocktails', 'ut', 'club', '9', 'th', 'floor', '7', '00', 'p', 'm', 'dinner', 'ut', 'club', '8', '00', 'p', 'm', 'keynote', 'speaker', '9', '00', 'p', 'm', 'bus', 'will', 'transport', 'guests', 'back', 'to', 'hotel', 'friday', 'feb', '23', '7', '45', 'a', 'm', 'bus', 'will', 'pick', 'up', 'at', 'the', 'radisson', 'for', 'transport', 'to', 'ut', '8', '30', 'a', 'm', 'session', '1', 'real', 'options', 'panelists', 'jim', 'dyer', 'ut', 'chair', 'sheridan', 'titman', 'ut', 'john', 'mccormack', 'stern', 'stewart', 'co', '10', '00', 'a', 'm', 'coffee', 'break', '10', '15', 'a', 'm', 'session', '2', 'deregulation', 'panelists', 'david', 'eaton', 'ut', 'chair', 'david', 'spence', 'ut', 'jeff', 'sandefer', 'sandefer', 'capital', 'partners', 'ut', 'peter', 'nance', 'teknecon', 'energy', 'risk', 'advisors', '11', '45', 'a', 'm', 'catered', 'lunch', 'keynote', 'speaker', '1', '30', 'p', 'm', 'guest', 'tour', 'eds', 'financial', 'trading', 'technology', 'center', '2', '00', 'p', 'm', 'session', '3', 'risk', 'management', 'panelists', 'keith', 'brown', 'ut', 'chair', 'vince', 'kaminski', 'enron', 'alexander', 'eydeland', 'southern', 'co', 'ehud', 'i', 'ronn', 'ut', '3', '30', 'p', 'm', 'snack', 'break', '3', '45', 'p', 'm', 'session', '4', 'globalization', 'of', 'the', 'energy', 'business', 'panelists', 'laura', 'starks', 'ut', 'chair', 'bob', 'goldman', 'conoco', 'ray', 'hill', 'southern', 'co', '5', '15', 'p', 'm', 'wrap', 'up', '5', '30', 'p', 'm', 'bus', 'picks', 'up', 'for', 'transport', 'to', 'airport', 'dinner', '6', '30', 'p', 'm', 'working', 'dinner', 'for', 'senior', 'officers', 'of', 'energy', 'finance', 'center', 'trustees', 'we', 'have', 'made', 'arrangements', 'to', 'provide', 'shuttle', 'service', 'between', 'the', 'radisson', 'hotel', 'and', 'ut', 'during', 'the', 'conference', 'however', 'if', 'you', 'choose', 'to', 'stay', 'at', 'an', 'alternative', 'hotel', 'then', 'transportation', 'to', 'conference', 'events', 'will', 'become', 'your', 'responsibility', 'angela', 'dorsey', 'assistant', 'director', 'center', 'for', 'energy', 'finance', 'education', 'research', 'the', 'university', 'of', 'texas', 'at', 'austin', 'department', 'of', 'finance', 'cba', '6', '222', 'austin', 'tx', '78712', 'angela', 'dorsey', 'bus', 'utexas', 'edu', '']
+    Label: 0
+    Tokenized text: ['subject', 'interview', 'jaesoo', 'lew', '10', '25', '00', 'attached', 'please', 'find', 'the', 'resume', 'interview', 'schedule', 'and', 'evaluation', 'form', 'for', 'jaesoo', 'lew', 'jaesoo', 'will', 'be', 'interviewing', 'with', 'vince', 'kaminski', 's', 'group', 'on', 'an', 'exploratory', 'basis', 'on', 'october', '25', '2000', 'please', 'contact', 'me', 'with', 'any', 'comments', 'or', 'concerns', 'thank', 'you', 'cheryl', 'arguijo', 'ena', 'recruiting', '713', '345', '4016']
+    Label: 0
+    Tokenized text: ['subject', 'custom', 'marketing', 'to', 'webmaster', 'ezmlm', 'org', 'email', 'is', 'the', 'best', 'promote', 'tool', 'we', 'offer', 'online', 'marketing', 'with', 'quality', 'service', '1', 'target', 'email', 'list', 'we', 'can', 'provide', 'target', 'email', 'list', 'you', 'need', 'which', 'are', 'compiled', 'only', 'on', 'your', 'order', 'we', 'will', 'customize', 'your', 'client', 'email', 'list', 'we', 'have', 'millions', 'of', 'lists', 'in', 'a', 'wide', 'variety', 'of', 'categories', '2', 'send', 'out', 'target', 'list', 'for', 'you', 'we', 'can', 'send', 'your', 'email', 'message', 'to', 'your', 'target', 'clients', 'we', 'will', 'customize', 'your', 'email', 'list', 'and', 'send', 'your', 'message', 'for', 'you', 'our', 'site', 'www', 'marketingforus', 'com', 'we', 'also', 'offer', 'web', 'hosting', 'mail', 'server', 'regards', 'jason', 'marketing', 'support', 'sales', 'marketingforus', 'com', 'no', 'thanks', 'byebye', 'msn', 'com', 'subject', 'webmaster', 'ezmlm', 'org']
+    Label: 1
+    Tokenized text: ['subject', 'popular', 'software', 'at', 'low', 'low', 'prices', 'alaina', 'windows', 'xp', 'professional', '2002', '50', 'adobe', 'photoshop', '7', '0', '50', 'microsoft', 'office', 'xp', 'professional', '2002', '50', 'corel', 'draw', 'graphics', 'suite', '11', '50', '']
+    Label: 1
+
+%% Cell type:markdown id: tags:
+
+## TASK 1: Implement `fit`
+Implement the "fit" and "predict" methods for Naive Bayes. Use $m$-estimate to address missing attribute values (also called **Laplace smoothing** when $m$ = 1). In general, $m$ values should be small. We'll use $m$ = 1. We'll use log probabilities to avoid overflow.
+
+%% Cell type:code id: tags:
+
+``` python
+class Naive_Bayes(Model):
+    def __init__(self, m):
+        '''
+        Args:
+            m: int
+                Specifies the smoothing parameter
+        '''
+        self.m = m
+
+    def fit(self, X, Y):
+        '''
+        Args:
+            X: list
+                list of lists where each entry is a tokenized email text
+            Y: ndarray
+                1D array of true labels. 1: spam, 0: ham
+        '''
+
+        #TODO
+
+        # Replace Nones, empty lists and dictionaries below
+
+        # List containing all distinct words in all emails
+        # A list might not be the best data structure for obtaining
+        # the vocabulary.
+        # Use a temporary more efficient data structure
+        # then populate self.vocabulary.
+
+        self.vocabulary = []
+
+        # find *log* class prior probabilities
+        self.prior = {'spam': None, 'ham': None}
+
+        # find the number of words(counting repeats) summed across all emails in a class
+        self.total_count = {'spam': None, 'ham': None}
+
+        # find the number of each word summed across all emails in a class
+        # populate self.word_counts
+        # self.word_counts['spam'] is a dict with words as keys.
+        self.word_counts = {'spam': {}, 'ham': {}}
+
+
+
+    def predict(self, X):
+        '''
+        Args:
+            X: list
+                list of lists where each entry is a tokenized email text
+        Returns:
+            probs: ndarray
+                mx2 array containing unnormalized log posteriors for spam and ham (for grading purposes)
+            preds: ndarray
+                1D binary array containing predicted labels
+        '''
+        preds = []
+        probs = []
+
+        #TODO
+
+        # use the attributes calculated in fit to compute unnormalized class posterior probabilities
+        # and predicted labels
+
+        raise NotImplementedError
+
+        return np.array(probs), np.array(preds)
+
+```
+
+%% Cell type:markdown id: tags:
+
+### Rubric:
+* correct vocabulary length +5,+5
+* correct log class priors +10, +10
+* correct word counts for the 5 most frequent words in each class +10, +10
+
+%% Cell type:markdown id: tags:
+
+### Test `fit`
+
+%% Cell type:code id: tags:
+
+``` python
+# initialize the model
+my_model = Naive_Bayes(...)
+
+# pass the training emails and labels
+my_model.fit(X=..., Y=...)
+
+# display the most frequent 5 words in both classes
+for cl in ['ham', 'spam']:
+    srt = sorted(my_model.word_counts[cl].items() ,  key=lambda x: x[1], reverse=True)
+    print('\n{} log prior: {}'.format(cl, my_model.prior[cl]))
+    print('5 most frequent words:')
+    print(srt[:5])
+print('\nVocabulary has {} distinct words'.format(len(my_model.vocabulary)))
+```
+
+%% Cell type:markdown id: tags:
+
+## TASK 2: Implement `predict`
+Print the unnormalized log posteriors for the first five examples. We can use the "conf_matrix" function to see how error is distributed.
+
+### Rubric:
+* Correct unnormalized log posteriors +20, +20
+* Correct confusion matrix +5, +5
+
+### Test `predict`
+
+%% Cell type:code id: tags:
+
+``` python
+probs, preds = my_model.predict(X=...)
+print ('\nUnnormalized log posteriors of first 5 test examples:')
+print (probs [:5])
+tp,tn, fp, fn = conf_matrix(true = ..., pred = preds)
+print('tp: {}, tn: {}, fp: {}, fn:{}'.format(tp, tn, fp, fn))
+```