diff --git a/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb b/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb index ef2dfceeaa732e83a318dd2a6f415cd755673e1b..c731ed8fe8368e101057b632a72d41638a19d1de 100644 --- a/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb +++ b/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb @@ -11,13 +11,13 @@ "* number of features (attributes) should be at least 50\n", "* number of examples (instances) should be between 1,000 - 5,000\n", "\n", - "A skeleton of a general supervised learning model is provided in \"model.ipynb\". Please look through it and complete the \"preprocess\" and \"partition\" methods. \n", + "A skeleton of a general supervised learning model is provided in \"model.ipynb\". The functions that will be implemented there will be indicated in this notebook. \n", "\n", "### Assignment Goals:\n", "In this assignment, we will:\n", - "* learn to split a dataset into training/validation/test partitions \n", + "* we'll implement 'Euclidean' and 'Manhattan' distance metrics \n", "* use the validation dataset to find a good value for *k*\n", - "* Having found the \"best\" *k*, we'll obtain final performance measures:\n", + "* Evaluate our model with respect to performance measures:\n", " * accuracy, generalization error and ROC curve\n" ] }, @@ -45,8 +45,13 @@ "|---|----------------|-----|-----|\n", "| 5 | Optimizing *k* | 10 | 10 |\n", "\n", + "### Bonus for 478/878\n", "\n", - "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 10 bonus points for 478 students." + "| | Tasks | 478 | 878 |\n", + "|---|----------------|-----|-----|\n", + "| 7 | Assess suitability of *k*-NN | 10 | 10 |\n", + "\n", + "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 20 bonus points for 478 students and 10 bonus points for 878 students." ] }, { @@ -93,7 +98,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value." + "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value." ] }, { @@ -102,7 +107,7 @@ "source": [ "### Rubric:\n", "* Euclidean +5, +5\n", - "* Hamming +5, +5" + "* Manhattan +5, +5" ] }, { @@ -121,7 +126,8 @@ "x = np.array(range(100))\n", "y = np.array(range(100, 200))\n", "dist_euclidean = distance(x, y, 'Euclidean')\n", - "dist_hamming = distance(x, y, 'Hamming')" + "dist_manhattan = distance(x, y, 'Manhattan')\n", + "print('Euclidean distance: {}, Manhattan distance: {}'.format(dist_euclidean, dist_manhattan))" ] }, { @@ -135,7 +141,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors." + "We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors." ] }, { @@ -158,25 +164,39 @@ " Inherits Model class. Implements the k-NN algorithm for classification.\n", " '''\n", " \n", - " def fit(self, k, distance_f, **kwargs):\n", + " def fit(self, training_features, training_labels, k, distance_f,**kwargs):\n", " '''\n", " Fit the model. This is pretty straightforward for k-NN.\n", + " Args:\n", + " training_features: ndarray\n", + " training_labels: ndarray\n", + " k: int\n", + " distance_f: function\n", + " kwargs: dict\n", + " Contains keyword arguments that will be passed to distance_f\n", " '''\n", " # TODO\n", - " # set self.k, self.distance_f, self.distance_metric\n", - " raise NotImplementedError\n", + " # set self.train_features, self.train_labels, self.k, self.distance_f, self.distance_metric\n", " \n", + " raise NotImplementedError\n", + "\n", " return\n", " \n", " \n", - " def predict(self, test_indices):\n", - " \n", + " def predict(self, test_features):\n", + " '''\n", + " Args:\n", + " test_features: ndarray\n", + " mxd array containing features for the points to be predicted\n", + " Returns: \n", + " ndarray\n", + " '''\n", " raise NotImplementedError\n", " \n", " pred = []\n", " # TODO\n", " \n", - " # for each point in test points\n", + " # for each point in test_features\n", " # use your implementation of distance function\n", " # distance_f(..., distance_metric)\n", " # to find the labels of k-nearest neighbors. \n", @@ -184,6 +204,8 @@ " # Find the ratio of the positive labels\n", " # and append to pred with pred.append(ratio).\n", " \n", + " # when calculating learning curve you can make use of\n", + " # self.learning_curve and self.training_proportion\n", "\n", " return np.array(pred)\n", " " @@ -210,7 +232,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Remember you need to provide values to *t*, *v* parameters for \"partition\" function and to *feature_file* and *label_file* for \"preprocess\" function." + "Preprocess the data files and partition the data." ] }, { @@ -219,10 +241,12 @@ "metadata": {}, "outputs": [], "source": [ - "# populate the keyword arguments dictionary kwargs\n", - "kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': ..., 'label_file': ...}\n", "# initialize the model\n", - "my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)" + "my_model = kNN()\n", + "# obtain features and labels from files\n", + "features, labels = preprocess(feature_file=..., label_file=...)\n", + "# partition the data set\n", + "val_indices, test_indices, train_indices = partition(size=..., t = 0.3, v = 0.1)" ] }, { @@ -238,15 +262,17 @@ "metadata": {}, "outputs": [], "source": [ + "# pass the training features and labels to the fit method\n", "kwargs_f = {'metric': 'Euclidean'}\n", - "my_model.fit(k = 10, distance_f=distance, **kwargs_f)" + "my_model.fit(training_features=..., training_labels-..., k=10, distance_f=..., **kwargs_f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate." + "### Computing the confusion matrix for *k* = 10\n", + "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet." ] }, { @@ -255,22 +281,28 @@ "metadata": {}, "outputs": [], "source": [ - "final_labels = my_model.predict(my_model.test_indices)\n", + "# TODO\n", + "\n", + "\n", + "# get model predictions\n", + "pred_ratios = my_model.predict(my_model.features[my_model.test_indices])\n", "\n", "# For now, we will consider a data point as predicted in the positive class if more than 0.5 \n", "# of its k-neighbors are positive.\n", "threshold = 0.5\n", + "# convert predicted ratios to predicted labels\n", + "pred_labels = None\n", "\n", - "# TODO\n", - "# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell." + "# obtain true positive, true negative,\n", + "#false positive and false negative counts using conf_matrix\n", + "tp,tn, fp, fn = conf_matrix(...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Computing the confusion matrix for *k* = 10\n", - "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet." + "Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the 95% confidence interval on the generalization **error** estimate." ] }, { @@ -280,7 +312,10 @@ "outputs": [], "source": [ "# TODO\n", - "conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold = 0.5)" + "# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell.\n", + "\n", + "print('Accuracy: {}'.format(accuracy))\n", + "print('Confidence interval: {}-{}'.format(lower_bound, upper_bound))" ] }, { @@ -309,15 +344,22 @@ "metadata": {}, "outputs": [], "source": [ - "# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size\n", - "training_sizes = np.arange(50, my_model.train_size + 1, 50)\n", + "# train using %10, %20, %30, ..., 100% of training data\n", + "training_proportions = np.arange(0.10, 1.01, 0.10)\n", + "train_size = len(train_indices)\n", + "training_sizes = np.int(np.ceil(size*proportion))\n", "\n", "# TODO\n", + "error_train = []\n", + "error_val = []\n", "\n", - "# Calculate error for each entry in training_sizes\n", - "# for training and validation sets and populate\n", - "# error_train and error_val arrays. Each entry in these arrays\n", - "# should correspond to each entry in training_sizes.\n", + "# For each size in training_sizes\n", + "for size in training_sizes:\n", + " # fit the model using \"size\" data porint\n", + " # Calculate error for training and validation sets\n", + " # populate error_train and error_val arrays. \n", + " # Each entry in these arrays\n", + " # should correspond to each entry in training_sizes.\n", "\n", "plt.plot(training_sizes, error_train, 'r', label = 'training_error')\n", "plt.plot(training_sizes, error_val, 'g', label = 'validation_error')\n", @@ -393,11 +435,26 @@ "metadata": {}, "outputs": [], "source": [ + "# TODO\n", "# ROC curve\n", "roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))\n", "plt.plot(roc_sens, roc_spec_)\n", "plt.show()" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TASK 7: Assess suitability of *k*-NN to your dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use this cell to write about your understanding of why *k*-NN performed well if it did or why not if it didn't." + ] } ], "metadata": { diff --git a/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb b/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..c3535482ded9e01f384f660abad8e0d6d88e74e8 --- /dev/null +++ b/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb @@ -0,0 +1,581 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# *k*-Nearest Neighbor\n", + "\n", + "We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n", + "* dependent variable should be binary (suited for binary classification)\n", + "* number of features (attributes) should be at least 50\n", + "* number of examples (instances) should be between 1,000 - 5,000\n", + "\n", + "A skeleton of a general supervised learning model is provided in \"model.ipynb\". The functions that will be implemented there will be indicated in this notebook. \n", + "\n", + "### Assignment Goals:\n", + "In this assignment, we will:\n", + "* we'll implement 'Euclidean' and 'Manhattan' distance metrics \n", + "* use the validation dataset to find a good value for *k*\n", + "* Evaluate our model with respect to performance measures:\n", + " * accuracy, generalization error and ROC curve\n", + "* Try to assess if *k*-NN is suitable for the dataset you used\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# GRADING\n", + "\n", + "You will be graded on parts that are marked with **\\#TODO** comments. Read the comments in the code to make sure you don't miss any.\n", + "\n", + "### Mandatory for 478 & 878:\n", + "\n", + "| | Tasks | 478 | 878 |\n", + "|---|----------------------------|-----|-----|\n", + "| 1 | Implement `distance` | 10 | 10 |\n", + "| 2 | Implement `k-NN` methods | 25 | 20 |\n", + "| 3 | Model evaluation | 25 | 20 |\n", + "| 4 | Learning curve | 20 | 20 |\n", + "| 6 | ROC curve analysis | 20 | 20 |\n", + "\n", + "### Mandatory for 878, bonus for 478\n", + "\n", + "| | Tasks | 478 | 878 |\n", + "|---|----------------|-----|-----|\n", + "| 5 | Optimizing *k* | 10 | 10 |\n", + "\n", + "### Bonus for 478/878\n", + "\n", + "| | Tasks | 478 | 878 |\n", + "|---|----------------|-----|-----|\n", + "| 7 | Assess suitability of *k*-NN | 10 | 10 |\n", + "\n", + "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 20 bonus points for 478 students and 10 bonus points for 878 students." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries." + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Following code makes the Model class and relevant functions available from model.ipynb." + ] + }, + { + "cell_type": "code", + "execution_count": 134, + "metadata": {}, + "outputs": [], + "source": [ + "%run 'model_solution.ipynb'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TASK 1: Implement `distance` function" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Rubric:\n", + "* Euclidean +5, +5\n", + "* Manhattan +5, +5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Test `distance`" + ] + }, + { + "cell_type": "code", + "execution_count": 136, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Euclidean distance: 1000.0, Manhattan distance: 10000\n" + ] + } + ], + "source": [ + "x = np.array(range(100))\n", + "y = np.array(range(100, 200))\n", + "dist_euclidean = distance(x, y, 'Euclidean')\n", + "dist_manhattan = distance(x, y, 'Manhattan')\n", + "print('Euclidean distance: {}, Manhattan distance: {}'.format(dist_euclidean, dist_manhattan))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TASK 2: Implement $k$-NN Class Methods" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Rubric:\n", + "* correct implementation of fit method +5, +5\n", + "* correct implementation of predict method +20, +15" + ] + }, + { + "cell_type": "code", + "execution_count": 137, + "metadata": {}, + "outputs": [], + "source": [ + "class kNN(Model):\n", + " '''\n", + " Inherits Model class. Implements the k-NN algorithm for classification.\n", + " '''\n", + " \n", + " def fit(self, training_features, training_labels, k, distance_f, **kwargs):\n", + " '''\n", + " Fit the model. This is pretty straightforward for k-NN.\n", + " Args:\n", + " training_features: ndarray\n", + " training_labels: ndarray\n", + " k: int\n", + " distance_f: function\n", + " kwargs: dict\n", + " Contains keyword arguments that will be passed to distance_f\n", + " '''\n", + " # TODO\n", + " # set self.train_features, self.train_labels,self.k, self.distance_f, self.distance_metric\n", + " self.train_features = training_features \n", + " self.train_labels = training_labels\n", + " self.k = k\n", + " self.distance_f = distance_f\n", + " self.distance_metric = kwargs['metric']\n", + " \n", + " return\n", + " \n", + " \n", + " def predict(self, test_features):\n", + " \n", + " test_size = len(test_features)\n", + " train_size = len(self.train_labels)\n", + " \n", + " pred = np.empty(len(test_features))\n", + " # TODO\n", + " # for each point in test points\n", + " for idx in range(test_size):\n", + " point = test_features[idx] \n", + " distances = []\n", + " labels = []\n", + " \n", + "\n", + " for tr_idx in range(train_size):\n", + " train_example = self.train_features[tr_idx]\n", + " train_label = self.train_labels[tr_idx]\n", + " dist = self.distance_f(point, train_example, metric = self.distance_metric)\n", + " distances.append(dist)\n", + " labels.append(train_label)\n", + " \n", + " # get the order of distances\n", + " dist_order = np.argsort(distances)\n", + " # get the labels of k points that are closest to test point\n", + " k_labels = list(np.array(labels)[dist_order[::-1]][:self.k])\n", + " \n", + " # get number of positive labels in k neighbours\n", + " b = k_labels.count(1)\n", + " \n", + " pred[idx] = b/self.k\n", + " \n", + " return pred\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TASK 3: Build and Evaluate the Model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Rubric:\n", + "* Reasonable accuracy values +10, +5\n", + "* Reasonable confidence intervals on the error estimate +10, +10\n", + "* Reasonable confusion matrix +5, +5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Preprocess the data files and partition the data." + ] + }, + { + "cell_type": "code", + "execution_count": 138, + "metadata": {}, + "outputs": [], + "source": [ + "# initialize the model\n", + "my_model = kNN()\n", + "# obtain features and labels from files\n", + "features, labels = preprocess('../data/madelon.data', '../data/madelon.labels')\n", + "# partition the data set\n", + "val_indices, test_indices, train_indices = partition(features.shape[0], t = 0.3, v = 0.1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Assign a value to *k* and fit the *k*-NN model." + ] + }, + { + "cell_type": "code", + "execution_count": 139, + "metadata": {}, + "outputs": [], + "source": [ + "# pass the training features and labels to the fit method\n", + "kwargs_f = {'metric': 'Euclidean'}\n", + "my_model.fit(features[train_indices], labels[train_indices], k=10, distance_f=distance, **kwargs_f)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Computing the confusion matrix for *k* = 10\n", + "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet." + ] + }, + { + "cell_type": "code", + "execution_count": 140, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO\n", + "\n", + "# get model predictions\n", + "pred_ratios = my_model.predict(features[test_indices])\n", + "# For now, we will consider a data point as predicted in the positive class if more than 0.5 \n", + "# of its k-neighbors are positive.\n", + "threshold = 0.5\n", + "# convert predicted ratios to predicted labels\n", + "pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]\n", + "tp,tn, fp, fn = conf_matrix(labels[test_indices], pred_labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the 95% confidence interval on the generalization **error** estimate." + ] + }, + { + "cell_type": "code", + "execution_count": 142, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy: 0.475\n", + "Confidence interval: 0.49110132745961876-0.5588986725403813\n" + ] + } + ], + "source": [ + "# TODO\n", + "# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell.\n", + "accuracy = (tp+tn)/len(test_indices)\n", + "error = 1 - accuracy\n", + "diff = 0.96 * np.sqrt((error * (1 - error)) / len(test_indices))\n", + "lower_bound = error - diff\n", + "upper_bound = error + diff\n", + "print('Accuracy: {}'.format(accuracy))\n", + "print('Confidence interval: {}-{}'.format(lower_bound, upper_bound))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ## TASK 4: Plotting a learning curve\n", + " \n", + "A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).\n", + "We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Rubric:\n", + "* Correct training error calculation for different training set sizes +8, +8\n", + "* Correct validation error calculation for different training set sizes +8, +8\n", + "* Reasonable learning curve +4, +4" + ] + }, + { + "cell_type": "code", + "execution_count": 144, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "600 is a good training size\n" + ] + } + ], + "source": [ + "# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size\n", + "training_proportions = np.arange(0.10, 1.01, 0.10)\n", + "\n", + "# TODO\n", + "\n", + "# Calculate error for each entry in training_sizes\n", + "# for training and validation sets and populate\n", + "# error_train and error_val arrays. Each entry in these arrays\n", + "# should correspond to each entry in training_sizes.\n", + "\n", + "error_train = []\n", + "error_val = []\n", + "training_sizes = []\n", + "for proportion in training_proportions:\n", + " \n", + " size = len(train_indices)\n", + " size_avail = np.int(np.ceil(size*proportion))\n", + " training_sizes.append(size_avail)\n", + " idx_avail = train_indices[:size_avail]\n", + " \n", + " kwargs_f = {'metric': 'Euclidean'}\n", + " my_model.fit(features[idx_avail], labels[idx_avail], k = 10, distance_f=distance, **kwargs_f)\n", + " \n", + " val_pred_ratios = my_model.predict(features[val_indices])\n", + " val_pred_labels = [1 if x >= threshold else 0 for x in val_pred_ratios]\n", + " tp,tn, fp, fn = conf_matrix(labels[val_indices], val_pred_labels)\n", + " val_accuracy = (tp+tn)/len(val_indices)\n", + " val_error = 1 - val_accuracy\n", + " error_val.append(val_error)\n", + " \n", + " train_pred_ratios = my_model.predict(features[train_indices])\n", + " train_pred_labels = [1 if x >= threshold else 0 for x in train_pred_ratios]\n", + " tp,tn, fp, fn = conf_matrix(labels[idx_avail], train_pred_labels)\n", + " train_accuracy = (tp+tn)/size_avail\n", + " train_error = 1 - train_accuracy\n", + " error_train.append(train_error)\n", + " \n", + "plt.plot(training_sizes, error_train, 'r', label = 'training_error')\n", + "plt.plot(training_sizes, error_val, 'g', label = 'validation_error')\n", + "plt.legend()\n", + "plt.show()\n", + "print('{} is a good training size'.format(600))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TASK 5: Determining *k*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Rubric:\n", + "* Increased accuracy with new *k* +5, +5\n", + "* Improved confusion matrix +5, +5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.\n", + "\n", + "Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Hint: Try values both smaller and larger than 10." + ] + }, + { + "cell_type": "code", + "execution_count": 157, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.49666666666666665, 0.55, 0.545, 0.505, 0.495, 0.465, 0.445]" + ] + }, + "execution_count": 157, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# TODO\n", + "k_accuracies = []\n", + "# Change values of k. \n", + "for k in [1, 5, 10, 50, 100, 150, 200]:\n", + " # Calculate accuracies for the validation set.\n", + " my_model.fit(features[train_indices], labels[train_indices], k=k, distance_f=distance, **kwargs_f)\n", + " pred_ratios = my_model.predict(features[val_indices])\n", + " pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]\n", + " tp,tn, fp, fn = conf_matrix(labels[val_indices], pred_labels)\n", + " accuracy = (tp+tn)/len(val_indices)\n", + " k_accuracies.append(accuracy)\n", + "k_accuracies\n", + "# Report a good k value." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TASK 6: ROC curve analysis\n", + "* ROC curve has correct shape +20, +20" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ROC curve and confusion matrix for the final model\n", + "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in \"model.ipynb\", a \"ROC\" function that predicts the labels of the test set examples using different *threshold* values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0." + ] + }, + { + "cell_type": "code", + "execution_count": 135, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# TODO\n", + "# ROC curve\n", + "my_model.fit(features[train_indices], labels[train_indices], k=100, distance_f=distance, **kwargs_f)\n", + "pred_ratios = my_model.predict(features[test_indices])\n", + "\n", + "roc_sens, roc_spec_ = ROC(labels[test_indices], pred_ratios, np.arange(0.1, 1.0, 0.1))\n", + "plt.plot(roc_sens, roc_spec_)\n", + "plt.xlabel('Sensitivity')\n", + "plt.ylabel('Specificity')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## TASK 7: Assess suitability of *k*-NN to your dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use this cell to write about your understanding of why *k*-NN performed well if it did or why not if it didn't." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/ProgrammingAssignment_1/model.ipynb b/ProgrammingAssignment_1/model.ipynb index 11f5f312809f9c848e95f3af250be014be25b6bf..62a4e1553f3b72b3fa3db9b91ef742f7e2a9ed8a 100644 --- a/ProgrammingAssignment_1/model.ipynb +++ b/ProgrammingAssignment_1/model.ipynb @@ -18,16 +18,48 @@ "source": [ "# Supervised Learning Model Skeleton\n", "\n", - "We'll use this skeleton for implementing different supervised learning algorithms. Please complete \"preprocess\" and \"partition\" methods below." + "We'll use this skeleton for implementing different supervised learning algorithms." ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "class Model:\n", + " # preprocess_f and partition_f expect functions\n", + " # use kwargs to pass arguments to preprocessor_f and partition_f\n", + " # kwargs is a dictionary and should contain t, v, feature_file, label_file\n", + " # e.g. {'t': 0.3, 'v': 0.1, 'feature_file': 'some_file_name', 'label_file': 'some_file_name'}\n", + " \n", + " def __init__(self, preprocessor_f, partition_f, **kwargs):\n", + " \n", + " self.features, self.labels = preprocessor_f(kwargs['feature_file'], kwargs['label_file'])\n", + " self.size = len(self.labels) # number of examples in dataset \n", + " self.feat_dim = self.features.shape[1] # number of features\n", + " \n", + " self.val_indices, self.test_indices = partition_f(self.size, kwargs['t'], kwargs['v'])\n", + " self.val_size = len(self.val_indices)\n", + " self.test_size = len(self.test_indices)\n", + " \n", + " self.train_indices = np.delete(np.arange(self.size), np.append(self.test_indices, self.val_indices), 0)\n", + " self.train_size = len(self.train_indices)\n", + " \n", + " def fit(self):\n", + " \n", + " raise NotImplementedError\n", + " \n", + " def predict(self, test_points):\n", + " raise NotImplementedError" + ] + }, + { + "cell_type": "code", + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ - "#TODO: GIVE!\n", "def preprocess(feature_file, label_file):\n", " '''\n", " Args:\n", @@ -41,13 +73,10 @@ " labels: ndarray\n", " nx1 labels\n", " '''\n", - " # You might find np.genfromtxt useful for reading in the file. Be careful with the file delimiter, \n", - " # e.g. for comma-separated files use delimiter=',' argument.\n", - " \n", - " # TODO \n", " \n", - " raise NotImplementedError\n", - "\n", + " # read in features and labels\n", + " features = np.genfromtxt(feature_file)\n", + " labels = np.genfromtxt(label_file)\n", " \n", " return features, labels" ] @@ -58,7 +87,6 @@ "metadata": {}, "outputs": [], "source": [ - "#TODO: GIVE!\n", "def partition(size, t, v = 0):\n", " '''\n", " Args:\n", @@ -74,16 +102,19 @@ " val_indices: ndarray\n", " 1D array containing validation set indices\n", " '''\n", + " \n", + " # number of test and validation examples\n", + " t_size = np.int(np.ceil(size*t))\n", + " v_size = np.int(np.ceil(size*v))\n", + "\n", + " # shuffle the indices\n", + " permuted = np.random.permutation(size)\n", " \n", - " # np.random.permutation might come in handy. Do not sample with replacement!\n", - " # Be sure not to use the same indices in test and validation sets!\n", - " \n", - " # use the first np.ceil(size*t) for test, \n", - " # the following np.ceil(size*v) for validation set.\n", - " \n", - " # TODO\n", + " # spare the first t_size for test\n", + " test_indices = permuted[:t_size]\n", + " # and the next v_size for validation\n", + " val_indices = permuted[t_size+1:t_size+v_size+1]\n", " \n", - " raise NotImplementedError\n", " \n", " return test_indices, val_indices" ] @@ -92,51 +123,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In cases, where data is not abundantly available, we resort to getting an error estimate from average of error on different splits of error. In this case, every fold of data is used for testing and for training in turns, i.e. assuming we split our data into 3 folds, we'd\n", - "* train our model on fold-1+fold-2 and test on fold-3\n", - "* train our model on fold-1+fold-3 and test on fold-2\n", - "* train our model on fold-2+fold-3 and test on fold-1.\n", - "\n", - "We'd use the average of the error we obtained in three runs as our error estimate. Implement function \"kfold\" below.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# TODO: Programming Assignment 2\n", - "\n", - "def kfold(indices, k):\n", - "\n", - " '''\n", - " Args:\n", - " indices: ndarray\n", - " 1D array with integer entries containing indices\n", - " k: int \n", - " Number of desired splits in data.(Assume test set is already separated.)\n", - " Returns:\n", - " fold_dict: dict\n", - " A dictionary with integer keys corresponding to folds. Values are (training_indices, val_indices).\n", - " \n", - " val_indices: ndarray\n", - " 1/k of training indices randomly chosen and separates them as validation partition.\n", - " train_indices: ndarray\n", - " Remaining 1-(1/k) of the indices.\n", - " \n", - " e.g. fold_dict = {0: (train_0_indices, val_0_indices), \n", - " 1: (train_0_indices, val_0_indices), 2: (train_0_indices, val_0_indices)} for k = 3\n", - " '''\n", - " \n", - " return fold_dict" + "## TASK 1: Implement `distance` function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value." + "\"distance\" function will be used in calculating cost of *k*-NN. It should take two data points and the name of the metric and return a scalar value." ] }, { @@ -154,13 +148,13 @@ " y: ndarray\n", " 1D array containing coordinates for a point\n", " metric: str\n", - " Euclidean, Hamming \n", + " Euclidean, Manhattan \n", " Returns:\n", - " \n", + " dist: float\n", " '''\n", " if metric == 'Euclidean':\n", " raise NotImplementedError\n", - " elif metric == 'Hammming':\n", + " elif metric == 'Manhattan':\n", " raise NotImplementedError\n", " else:\n", " raise ValueError('{} is not a valid metric.'.format(metric))\n", @@ -171,48 +165,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will extend \"Model\" class while implementing supervised learning algorithms." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "class Model:\n", - " # preprocess_f and partition_f expect functions\n", - " # use kwargs to pass arguments to preprocessor_f and partition_f\n", - " # kwargs is a dictionary and should contain t, v, feature_file, label_file\n", - " # e.g. {'t': 0.3, 'v': 0.1, 'feature_file': 'some_file_name', 'label_file': 'some_file_name'}\n", - " \n", - " def __init__(self, preprocessor_f, partition_f, **kwargs):\n", - " \n", - " self.features, self.labels = preprocessor_f(kwargs['feature_file'], kwargs['label_file'])\n", - " self.size = len(self.labels) # number of examples in dataset \n", - " self.feat_dim = self.features.shape[1] # number of features\n", - " \n", - " self.val_indices, self.test_indices = partition_f(self.size, kwargs['t'], kwargs['v'])\n", - " self.val_size = len(self.val_indices)\n", - " self.test_size = len(self.test_indices)\n", - " \n", - " self.train_indices = np.delete(np.arange(self.size), np.append(self.test_indices, self.val_indices), 0)\n", - " self.train_size = len(self.train_indices)\n", - " \n", - " def fit(self):\n", - " \n", - " raise NotImplementedError\n", - " \n", - " def predict(self, test_points):\n", - " raise NotImplementedError" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## General supervised learning performance related functions \n", - "### (To be implemented later when it is indicated in other notebooks)" + "## General supervised learning performance related functions " ] }, { diff --git a/ProgrammingAssignment_1/model_solution.ipynb b/ProgrammingAssignment_1/model_solution.ipynb index 8e1f78b9b8ba4c77fe134126e7c004459a988f82..a4582f3c30f305943118e6756d059094d013902c 100644 --- a/ProgrammingAssignment_1/model_solution.ipynb +++ b/ProgrammingAssignment_1/model_solution.ipynb @@ -18,12 +18,28 @@ "source": [ "# Supervised Learning Model Skeleton\n", "\n", - "We'll use this skeleton for implementing different supervised learning algorithms. Please complete \"preprocess\" and \"partition\" methods below." + "We'll use this skeleton for implementing different supervised learning algorithms." ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "class Model:\n", + " \n", + " def fit(self):\n", + " \n", + " raise NotImplementedError\n", + " \n", + " def predict(self, test_points):\n", + " raise NotImplementedError" + ] + }, + { + "cell_type": "code", + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -40,13 +56,10 @@ " labels: ndarray\n", " nx1 labels\n", " '''\n", - " # You might find np.genfromtxt useful for reading in the file. Be careful with the file delimiter, \n", - " # e.g. for comma-separated files use delimiter=',' argument.\n", - " \n", - " # TODO \n", " \n", - " raise NotImplementedError\n", - "\n", + " # read in features and labels\n", + " features = np.genfromtxt(feature_file)\n", + " labels = np.genfromtxt(label_file)\n", " \n", " return features, labels" ] @@ -57,7 +70,6 @@ "metadata": {}, "outputs": [], "source": [ - "#TODO: GIVE!\n", "def partition(size, t, v = 0):\n", " '''\n", " Args:\n", @@ -74,73 +86,39 @@ " 1D array containing validation set indices\n", " '''\n", " \n", - " # np.random.permutation might come in handy. Do not sample with replacement!\n", - " # Be sure not to use the same indices in test and validation sets!\n", - " \n", - " # use the first np.ceil(size*t) for test, \n", - " # the following np.ceil(size*v) for validation set.\n", - " \n", - " # TODO\n", + " # number of test and validation examples\n", + " t_size = np.int(np.ceil(size*t))\n", + " v_size = np.int(np.ceil(size*v))\n", + "\n", + " # shuffle the indices\n", + " permuted = np.random.permutation(size)\n", " \n", - " raise NotImplementedError\n", + " # spare the first t_size for test\n", + " test_indices = permuted[:t_size]\n", + " # and the next v_size for validation\n", + " val_indices = permuted[t_size+1:t_size+v_size+1]\n", + " train_indices = np.delete(np.arange(size), np.append(test_indices, val_indices), 0)\n", " \n", - " return test_indices, val_indices" + " return test_indices, val_indices, train_indices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In cases, where data is not abundantly available, we resort to getting an error estimate from average of error on different splits of error. In this case, every fold of data is used for testing and for training in turns, i.e. assuming we split our data into 3 folds, we'd\n", - "* train our model on fold-1+fold-2 and test on fold-3\n", - "* train our model on fold-1+fold-3 and test on fold-2\n", - "* train our model on fold-2+fold-3 and test on fold-1.\n", - "\n", - "We'd use the average of the error we obtained in three runs as our error estimate. Implement function \"kfold\" below.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# TODO: Programming Assignment 2\n", - "\n", - "def kfold(indices, k):\n", - "\n", - " '''\n", - " Args:\n", - " indices: ndarray\n", - " 1D array with integer entries containing indices\n", - " k: int \n", - " Number of desired splits in data.(Assume test set is already separated.)\n", - " Returns:\n", - " fold_dict: dict\n", - " A dictionary with integer keys corresponding to folds. Values are (training_indices, val_indices).\n", - " \n", - " val_indices: ndarray\n", - " 1/k of training indices randomly chosen and separates them as validation partition.\n", - " train_indices: ndarray\n", - " Remaining 1-(1/k) of the indices.\n", - " \n", - " e.g. fold_dict = {0: (train_0_indices, val_0_indices), \n", - " 1: (train_0_indices, val_0_indices), 2: (train_0_indices, val_0_indices)} for k = 3\n", - " '''\n", - " \n", - " return fold_dict" + "## TASK 1: Implement `distance` function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value." + "\"distance\" function will be used in calculating cost of *k*-NN. It should take two data points and the name of the metric and return a scalar value." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -155,12 +133,12 @@ " metric: str\n", " Euclidean, Hamming \n", " Returns:\n", - " \n", + " dist: float\n", " '''\n", " if metric == 'Euclidean':\n", - " raise NotImplementedError\n", - " elif metric == 'Hammming':\n", - " raise NotImplementedError\n", + " dist = np.sqrt(np.sum(np.square((x-y))))\n", + " elif metric == 'Manhattan':\n", + " dist = np.sum(abs(x-y))\n", " else:\n", " raise ValueError('{} is not a valid metric.'.format(metric))\n", " return dist # scalar distance btw x and y" @@ -170,48 +148,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will extend \"Model\" class while implementing supervised learning algorithms." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "class Model:\n", - " # preprocess_f and partition_f expect functions\n", - " # use kwargs to pass arguments to preprocessor_f and partition_f\n", - " # kwargs is a dictionary and should contain t, v, feature_file, label_file\n", - " # e.g. {'t': 0.3, 'v': 0.1, 'feature_file': 'some_file_name', 'label_file': 'some_file_name'}\n", - " \n", - " def __init__(self, preprocessor_f, partition_f, **kwargs):\n", - " \n", - " self.features, self.labels = preprocessor_f(kwargs['feature_file'], kwargs['label_file'])\n", - " self.size = len(self.labels) # number of examples in dataset \n", - " self.feat_dim = self.features.shape[1] # number of features\n", - " \n", - " self.val_indices, self.test_indices = partition_f(self.size, kwargs['t'], kwargs['v'])\n", - " self.val_size = len(self.val_indices)\n", - " self.test_size = len(self.test_indices)\n", - " \n", - " self.train_indices = np.delete(np.arange(self.size), np.append(self.test_indices, self.val_indices), 0)\n", - " self.train_size = len(self.train_indices)\n", - " \n", - " def fit(self):\n", - " \n", - " raise NotImplementedError\n", - " \n", - " def predict(self, test_points):\n", - " raise NotImplementedError" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## General supervised learning performance related functions \n", - "### (To be implemented later when it is indicated in other notebooks)" + "## General supervised learning performance related functions " ] }, { @@ -223,7 +160,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ @@ -239,12 +176,28 @@ " Returns:\n", " ndarray\n", " '''\n", - " raise NotImplementedError\n", - " \n", + " \n", " tp = tn = fp = fn = 0\n", " # calculate true positives (tp), true negatives(tn)\n", " # false positives (fp) and false negatives (fn)\n", " \n", + " size = len(true)\n", + " for i in range(size):\n", + " if true[i]==1:\n", + " if pred[i] > 0:\n", + " \n", + " tp += 1\n", + " else:\n", + " \n", + " fn += 1\n", + " else:\n", + " if pred[i] == 0:\n", + " \n", + " tn += 1 \n", + " else:\n", + " \n", + " fp += 1 \n", + " \n", " # returns the confusion matrix as numpy.ndarray\n", " return np.array([tp,tn, fp, fn])" ] @@ -258,18 +211,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# TODO: Programming Assignment 1\n", "\n", - "def ROC(model, indices, value_list):\n", + "def ROC(true_labels, preds, value_list):\n", " '''\n", " Args:\n", - " model: a fitted supervised learning model\n", - " indices: ndarray\n", - " 1D array containing indices\n", + " true_labels: ndarray\n", + " 1D array containing true labels\n", + " preds: ndarray\n", + " 1D array containing thresholded value (e.g. proportion of positive neighbors in kNN)\n", " value_list: ndarray\n", " 1D array containing different threshold values\n", " Returns:\n", @@ -283,10 +237,17 @@ " # use conf_matrix to calculate tp, tn, fp, fn\n", " # calculate sensitivity, 1-specificity\n", " # return two arrays\n", - " \n", - " raise NotImplementedError\n", - " \n", - " return sens, spec_" + " sens = []\n", + " spec_ = []\n", + " for threshold in value_list:\n", + " pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]\n", + " tp,tn, fp, fn = conf_matrix(true_labels, pred_labels) \n", + " se = tp/(tp+fn)\n", + " sens.append(se)\n", + " spec = tn/(tn+fp)\n", + " spec_.append(1 - spec)\n", + " \n", + " return np.array(sens), np.array(spec_)" ] } ],