diff --git a/ProgrammingAssignment1.ipynb b/ProgrammingAssignment1.ipynb index 05ab703fa24082c8e98a026fff8a749803307a6f..bbf3a1bf8b41574ed85277b8abeb4a2594d50762 100644 --- a/ProgrammingAssignment1.ipynb +++ b/ProgrammingAssignment1.ipynb @@ -276,7 +276,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier. (Use the $k$ value you found above.)" + "We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier. (Use the $k$ value you found above.) We'll plot the ROC curve for values between 0.1 and 1.0." ] }, { diff --git a/ProgrammingAssignment2.ipynb b/ProgrammingAssignment2.ipynb index 41badf7aafb60b063f53955f8e76c2e87c35f603..b92ee64c0185b3f5f3da3fd2c7d984f4e2581d72 100644 --- a/ProgrammingAssignment2.ipynb +++ b/ProgrammingAssignment2.ipynb @@ -6,7 +6,7 @@ "source": [ "# Linear Regression & Naive Bayes\n", "\n", - "We'll implement linear regression & Naive Bayes algorithms for this assignment. Please modify the \"preprocess\" and \"partition\" methods in \"model.ipynb\" to suit your datasets for this assignment. In this assignment, we have a small dataset available to us. We won't have examples to spare for validation set, instead we'll use cross-validation to tune hyperparameters.\n", + "We'll implement linear regression & Naive Bayes algorithms for this assignment. Please modify the \"preprocess\" in this notebook and \"partition\" method in \"model.ipynb\" to suit your datasets for this assignment. In the linear regression part of this assignment, we have a small dataset available to us. We won't have examples to spare for validation set, instead we'll use cross-validation to tune hyperparameters. In our Naive Bayes implementation, we will not use validation set or crossvalidation.\n", "\n", "### Assignment Goals:\n", "In this assignment, we will:\n", @@ -31,7 +31,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -48,52 +48,80 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run 'model.ipynb'" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll implement the \"preprocess\" function and \"kfold\" function for $k$-fold cross-validation in \"model.ipynb\". 5 and 10 are commonly used values for $k$. You can use either one of them." + ] + }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ - "def mse(y_pred, y_true):\n", + "def preprocess(file_path):\n", " '''\n", - " y_hat: values predicted by our method\n", - " y_true: true y values\n", + " file_path: where to read the dataset from\n", + " Returns:\n", + " features: ndarray\n", + " nxd array containing `float` feature values\n", + " labels: ndarray\n", + " 1D array containing `float` label\n", " '''\n", + " # You might find np.genfromtxt useful for reading in the file. Be careful with the file delimiter, \n", + " # e.g. for comma-separated files use delimiter=',' argument.\n", + " \n", " raise NotImplementedError\n", - " # returns mean squared error between y_pred and y_true\n", - " return cost\n", - " " + "\n", + " \n", + " return features, labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We'll start by implementing a partition function for $k$-fold cross-validation. $5$ and $10$ are commonly used values for $k$. You can use either one of them." + "We'll need to use mean squared error (mse) for linear regression. Next, implement \"mse\" function that takes predicted and true y values, and returns the \"mse\" between them." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ - "def kfold(k):\n", + "def mse(y_pred, y_true):\n", " '''\n", - " k: number of desired splits in data.\n", - " Assume test set is already separated.\n", - " This function chooses 1/k of training indices randomly and separates them as validation partition.\n", - " It returns the 1/k selected indices as val_indices and the remaining 1-(1/k) of the indices as train_indices\n", + " Args:\n", + " y_hat: ndarray \n", + " 1D array containing data with `float` type. Values predicted by our method\n", + " y_true: ndarray\n", + " 1D array containing data with `float` type. True y values\n", + " Returns:\n", + " cost: float\n", + " A single value. Mean squared error between y_pred and y_true.\n", + " \n", " '''\n", - " \n", - " return train_indices, val_indices" + " raise NotImplementedError\n", + "\n", + " return cost\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can define our linear_regression model class now. Implement the \"fit\" and \"predict\" methods. Keep the default values for now, later we'll change the $polynomial\\_degree$. If your \"kfold\" implementation works as it should, each call to fit and predict " ] }, { @@ -105,27 +133,53 @@ "class linear_regression(Model):\n", " def __init__(self, preprocessor_f, partition_f, **kwargs):\n", " super().__init__(preprocessor_f, partition_f, **kwargs)\n", - " theta = \n", + " if k_fold:\n", + " self.data_dict = kfold(self.train_indices, k = kwargs['k'])\n", + " # counter for train fold\n", + " self.i = 0\n", + " # counter for test fold\n", + " self.j = 0 \n", + " \n", " # You can disregard polynomial_degree and regularizer in your first pass\n", " def fit(self, learning_rate = 0.001, epochs = 1000, regularizer=None, polynomial_degree=1, **kwargs):\n", " \n", + " train_features = self.train_features[self.data_dict[self.i]]\n", + " train_labels = self.train_labels[self.data_dict[self.i]]\n", + " \n", + " #initialize theta_cur randomly\n", + " \n", " # for each epoch\n", - " # compute y_hat array which holds model predictions for training examples\n", + " # compute model predictions for training examples\n", " y_hat = None\n", - " # use mse function to find the cost\n", - " cost = None\n", - " # calculate gradients wrt theta\n", - " grad_theta = None\n", - " # update theta\n", - " theta_curr = None\n", - " raise NotImplementedError\n", " \n", - " return theta\n", - " \n", + " if regularizer = None:\n", + " \n", + " # use mse function to find the cost\n", + " cost = None\n", + " # calculate gradients wrt theta\n", + " grad_theta = None\n", + " # update theta\n", + " theta_curr = None\n", + " raise NotImplementedError\n", + " \n", + " else:\n", + " # take regularization into account\n", + " raise NotImplementedError\n", + " \n", + " # update the model parameters to be used in predict method\n", + " self.theta = theta_curr\n", + " # increment counter for next fold\n", + " self.i += 1\n", + " \n", " def predict(self, indices):\n", " \n", + " # obtain test features for current fold\n", + " \n", + " test_features = self.train_features[self.data_dict[self.j]]\n", " raise NotImplementedError\n", " \n", + " # increment counter for next fold\n", + " self.j += 1\n", " return y_hat\n", " " ] @@ -137,7 +191,9 @@ "outputs": [], "source": [ "# populate the keyword arguments dictionary kwargs\n", - "kwargs = {'p': 0.3, 'v': 0.0, 'file_path': 'mnist_test.csv', 'k': 5}\n", + "# p: proportion for test data\n", + "# k: parameter for k-fold crossvalidation\n", + "kwargs = {'p': 0.3, 'v': 0.1, 'file_path': 'madelon', 'k': 1}\n", "# initialize the model\n", "my_model = linear_regression(preprocessor_f=preprocess, partition_f=partition, k_fold=True, **kwargs)" ] @@ -149,6 +205,8 @@ "outputs": [], "source": [ "# use fit_kwargs to pass arguments to regularization function\n", + "# fit_kwargs is empty for now since we are not applying \n", + "# regularization yet\n", "fit_kwargs = {}\n", "my_model.fit(**fit_kwargs)" ] @@ -157,7 +215,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Residuals are the differences between the predicted value $y_{hat}$ and the true value $y$ for each example. Predict $y_{hat}$ for the validation set. Calculate and plot residuals. " + "Residuals are the differences between the predicted value $y_{hat}$ and the true value $y$ for each example. Predict $y_{hat}$ for the validation set." ] }, { @@ -176,7 +234,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If the data is better suited for quadratic/cubic regression, regions of positive and negative residuals will alternate in the plot. Regardless, modify fit\" and \"predict\" in the class definition to raise the feature values to $polynomial\\_degree$. You can directly make the modification in the above definition, do not repeat. Use the validation set to find the degree of polynomial that results in lowest \"mse\"." + "If the data is better suited for quadratic/cubic regression, regions of positive and negative residuals will alternate in the plot. Regardless, modify fit\" and \"predict\" in the class definition to raise the feature values to $polynomial\\_degree$. You can directly make the modification in the above definition, do not repeat. Use the validation set to find the degree of polynomial that results in lowest _mse_." ] }, { @@ -185,21 +243,24 @@ "metadata": {}, "outputs": [], "source": [ - "# calculate mse for linear model\n", + "kwargs = {'p': 0.3, 'file_path': 'madelon', 'k': 5}\n", + "# initialize the model\n", + "my_model = linear_regression(preprocessor_f=preprocess, partition_f=partition, k_fold=True, **kwargs)\n", + "\n", "fit_kwargs = {}\n", - "my_model.fit(polynomial_degree = 1 ,**fit_kwargs)\n", - "pred_3 = my_model.predict(my_model.features[my_model.val_indices])\n", - "mse_1 = mse(pred_2, my_model.labels[my_model.val_indices])\n", "\n", - "# calculate mse for quadratic model\n", - "my_model.fit(polynomial_degree = 2 ,**fit_kwargs)\n", - "pred_2 = my_model.predict(my_model.features[my_model.val_indices])\n", - "mse_2 = mse(pred_2, my_model.labels[my_model.val_indices])\n", + "# calculate mse for each of linear model, quadratic and cubic models\n", + "# and append to mses_for_models\n", + "\n", + "mses_for_models = []\n", "\n", - "# calculate mse for cubic model\n", - "my_model.fit(polynomial_degree = 3 ,**fit_kwargs)\n", - "pred_3 = my_model.predict(my_model.features[my_model.val_indices])\n", - "mse_3 = mse(pred_2, my_model.labels[my_model.val_indices])" + "for i in range(1,4):\n", + " kfold_mse = 0\n", + " for k in range(5):\n", + " my_model.fit(polynomial_degree = i ,**fit_kwargs)\n", + " pred = my_model.predict(my_model.features[my_model.val_indices], fold = k)\n", + " k_fold_mse += mse(pred, my_model.labels[my_model.val_indices])\n", + " mses_for_models_for_models.append(k_fold_mse/k)" ] }, { @@ -215,18 +276,154 @@ "metadata": {}, "outputs": [], "source": [ - "def regularization(method):\n", + "def regularization(weights, method):\n", + " '''\n", + " Args:\n", + " weights: ndarray\n", + " 1D array with `float` entries\n", + " method: str\n", + " Returns:\n", + " value: float\n", + " A single value. Regularization term that will be used in cost function in fit.\n", + " '''\n", " if method == \"l1\":\n", + " value = None\n", " raise NotImplementedError\n", " elif method == \"l2\":\n", - " raise NotImplementedError" + " value = None\n", + " raise NotImplementedError\n", + " return value" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using crossvalidation and the value of $polynomial_{degree}$ you found above, try different values of $\\lambda$ to find a a good value that results in low _mse_. Report the best values you found for hyperparameters and the resulting _mse_." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Naive Bayes Spam Classifier\n", + "\n", + "This part is independent of the above part. We will use the Enron spam/ham dataset. You will need to decompress the provided \"enron.tar.gz\" folder. The two subfolders contain spam and ham emails.\n", + "\n", + "The features for Naive Bayes algorithm will be word counts. Number of features will be equal to the unique words seen in the whole dataset. The \"preprocess\" function will be more involved this time. You'll need to remove pucntuation marks (you may find string.punctuation useful), tokenize text to words (remember to lowercase all) and count the number of words." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def preprocess_bayes(folder_path):\n", + " '''\n", + " Args:\n", + " folder_path: str\n", + " Where to read the dataset from.\n", + " Returns:\n", + " features: ndarray\n", + " nxd array with n emails, d words. features_ij is the count of word_j in email_i\n", + " labels: ndarray\n", + " 1D array of labels (1: spam, 0: ham)\n", + " '''\n", + " # remove punctutaion marks\n", + " # tokenize, lowercase\n", + " # count number of words in each email\n", + " \n", + " raise NotImplementedError\n", + "\n", + " \n", + " return features, labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Using the validation set, and the value of $polynomial_{degree}$ you found above, try different values of $\\lambda$ to find a a good value that results in low $mse$. You can " + "Implement the \"fit\" and \"predict\" methods for Naive Bayes. Use $m$-estimate to address missing attribute values (also called **Laplace smoothing** when $m$ = 1). In general, $m$ values should be small. We'll use $m$ = 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class naive_bayes(Model):\n", + " def __init__(self, preprocessor_f, partition_f, **kwargs):\n", + " super().__init__(preprocessor_f, partition_f, **kwargs)\n", + " \n", + " def fit(self, m, **kwargs):\n", + " \n", + " self.ham_word_counts = np.zeros(self.feat_dim)\n", + " self.spam_word_counts = np.zeros(self.feat_dim)\n", + " \n", + " # find class prior probabilities\n", + " self.ham_prior = None\n", + " self.spam_prior = None\n", + " # find the number of words(counting repeats) summed across all emails in a class\n", + " n = None\n", + " # find the number of each word summed across all emails in a class\n", + " # populate self.ham_word_counts and self.spam_word_counts\n", + " \n", + " # find the likelihood of a word_i in each class\n", + " # 1D ndarray\n", + " self.ham_likelihood = None\n", + " self.spam_likelihood = None\n", + " \n", + " \n", + " def predict(self, indices):\n", + " '''\n", + " Returns:\n", + " preds: ndarray\n", + " 1D binary array containing predicted labels\n", + " '''\n", + " raise NotImplementedError\n", + " \n", + " return preds\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can fit our model and see how accurately it predicts spam emails now. We won't use a validation set or crossvalidation this time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# populate the keyword arguments dictionary kwargs\n", + "# p: proportion for test data\n", + "# k: parameter for k-fold crossvalidation\n", + "kwargs = {'p': 0.3, 'file_path': 'enron'}\n", + "# initialize the model\n", + "my_model = linear_regression(preprocessor_f=preprocess_bayes, partition_f=partition, **kwargs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use the \"conf_matrix\" function we defined before to see how error is distributed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preds = my_model.predict(my_model.test_indices)\n", + "tp,tn, fp, fn = conf_matrix(true = my_model.features[my_model.test_indices], pred = preds)" ] } ], diff --git a/model.ipynb b/model.ipynb index 94618fe6006d89f2379d0484c4ea64eb82020509..03b6a2f9dbc97f0ebae63efb4ce7b8959cf29cd1 100644 --- a/model.ipynb +++ b/model.ipynb @@ -57,11 +57,11 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ - "def partition(size, p, v):\n", + "def partition(size, p, v = 0):\n", " '''\n", " size: number of examples in the whole dataset\n", " p: proportion kept for test\n", @@ -80,6 +80,47 @@ " return val_indices, test_indices" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In cases, where data is not abundantly available, we resort to getting an error estimate from average of error on different splits of error. In this case, every fold of data is used for testing and for training in turns, i.e. assuming we split our data into 3 folds, we'd\n", + "* train our model on fold-1+fold-2 and test on fold-3\n", + "* train our model on fold-1+fold-3 and test on fold-2\n", + "* train our model on fold-2+fold-3 and test on fold-1.\n", + "\n", + "We'd use the average of the error we obtained in three runs as our error estimate. Implement function \"kfold\" below.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def kfold(indices, k):\n", + " '''\n", + " Args:\n", + " indices: ndarray\n", + " 1D array with integer entries containing indices\n", + " k: int \n", + " Number of desired splits in data.(Assume test set is already separated.)\n", + " Returns:\n", + " fold_dict: dict\n", + " A dictionary with integer keys corresponding to folds. Values are (training_indices, val_indices).\n", + " \n", + " val_indices: ndarray\n", + " 1/k of training indices randomly chosen and separates them as validation partition.\n", + " train_indices: ndarray\n", + " Remaining 1-(1/k) of the indices.\n", + " \n", + " e.g. fold_dict = {0: (train_0_indices, val_0_indices), \n", + " 1: (train_0_indices, val_0_indices), 2: (train_0_indices, val_0_indices)} for k = 3\n", + " '''\n", + " \n", + " return fold_dict" + ] + }, { "cell_type": "code", "execution_count": 1, @@ -106,9 +147,10 @@ " self.train_size = len(self.train_indices)\n", " \n", " def fit(self):\n", + " \n", " raise NotImplementedError\n", " \n", - " def predict(self, testpoint):\n", + " def predict(self, indices):\n", " raise NotImplementedError" ] },