From 88f10a757294d9c34a0546e96f1cf343bb66cfe0 Mon Sep 17 00:00:00 2001 From: Zeynep Hakguder <zhakguder@cse.unl.edu> Date: Fri, 1 Jun 2018 15:20:07 -0500 Subject: [PATCH] clean --- .../ProgrammingAssignment1-Solution.ipynb | 279 --------- .../ProgrammingAssignment1_solution.ipynb | 581 ------------------ ProgrammingAssignment_1/model_solution.ipynb | 274 --------- 3 files changed, 1134 deletions(-) delete mode 100644 ProgrammingAssignment_1/ProgrammingAssignment1-Solution.ipynb delete mode 100644 ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb delete mode 100644 ProgrammingAssignment_1/model_solution.ipynb diff --git a/ProgrammingAssignment_1/ProgrammingAssignment1-Solution.ipynb b/ProgrammingAssignment_1/ProgrammingAssignment1-Solution.ipynb deleted file mode 100644 index c279662..0000000 --- a/ProgrammingAssignment_1/ProgrammingAssignment1-Solution.ipynb +++ /dev/null @@ -1,279 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# k-Nearest Neighbor" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can use numpy for array operations and matplpotlib for plotting for this assignment. Please do not add other libraries." - ] - }, - { - "cell_type": "code", - "execution_count": 247, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import matplotlib.pyplot as plt" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Following code makes the Model class and relevant functions available from model.ipynb." - ] - }, - { - "cell_type": "code", - "execution_count": 256, - "metadata": {}, - "outputs": [], - "source": [ - "%run 'model-Solution.ipynb'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Choice of distance metric plays an important role in the performance of kNN. Let's start by implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value." - ] - }, - { - "cell_type": "code", - "execution_count": 257, - "metadata": {}, - "outputs": [], - "source": [ - "def distance(x, y, metric):\n", - " '''\n", - " x: a 1xd array\n", - " y: a 1xd array\n", - " metric: Euclidean, Hamming, etc.\n", - " '''\n", - " #raise NotImplementedError\n", - " \n", - " if metric == 'Euclidean':\n", - " dist = np.sqrt(np.sum(np.square((x-y))))\n", - " \n", - " ####################################\n", - " return dist # scalar distance btw x and y" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can implement our kNN classifier. kNN class inherits Model class. Implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes $k$ as an argument. \"predict\" takes as input the feature vector for a single test point and outputs the predicted class, and the proportion of predicted class labels in $k$ nearest neighbors." - ] - }, - { - "cell_type": "code", - "execution_count": 283, - "metadata": {}, - "outputs": [], - "source": [ - "class kNN(Model):\n", - "\n", - " def fit(self, k, distance_f, **kwargs):\n", - " \n", - " #raise NotImplementedError\n", - " \n", - " self.k = k\n", - " self.distance_f = distance_f\n", - " self.distance_metric = kwargs['metric']\n", - " \n", - " \n", - " #######################\n", - " return\n", - " # vary the threshold value for ROC analysis\n", - " def predict(self, test_points):\n", - " \n", - " chosen_labels = []\n", - " for test_point in self.features[test_indices]:\n", - " #raise NotImplementedError\n", - " tmp_dist = [np.inf] * self.k\n", - " distances = []\n", - "\n", - " labels = []\n", - " for index in self.training_indices:\n", - " dist = self.distance_f(self.features[index], test_point, self.distance_metric)\n", - " distances.append(dist)\n", - " labels.append(self.labels[index])\n", - " a_order = np.argsort(distances)\n", - " tmp_labels = list(np.array(labels)[a_order[::-1]][:self.k])\n", - " b = tmp_labels.count(1)\n", - " chosen_labels.append(b/self.k)\n", - " \n", - " ##########################\n", - " # return the predicted class label and the following ratio: \n", - " # number of points that have the same label as the test point / k\n", - " return np.array(chosen_labels)\n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for \"partition\" function and to $file\\_path$ for \"preprocess\" function." - ] - }, - { - "cell_type": "code", - "execution_count": 284, - "metadata": {}, - "outputs": [], - "source": [ - "# populate the keyword arguments dictionary kwargs\n", - "kwargs = {'p': 0.3, 'v': 0.1, 'seed': 123, 'file_path': 'madelon_train'}\n", - "# initialize the model\n", - "my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Assign a value to $k$ and fit the kNN model. You do not need to change the value of the $threshold$ parameter yet." - ] - }, - { - "cell_type": "code", - "execution_count": 285, - "metadata": {}, - "outputs": [], - "source": [ - "kwargs_f = {'metric': 'Euclidean'}\n", - "my_model.fit(k = 10, distance_f=distance, **kwargs_f)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Evaluate your model on the test data and report your accuracy. Also, calculate and report the confidence interval on the generalization error estimate." - ] - }, - { - "cell_type": "code", - "execution_count": 286, - "metadata": {}, - "outputs": [], - "source": [ - "final_labels = my_model.predict(my_model.test_indices)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. " - ] - }, - { - "cell_type": "code", - "execution_count": 289, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([196, 106, 193, 105])" - ] - }, - "execution_count": 289, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# You should see array([ 196, 106, 193, 105]) with seed 123\n", - "conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold= 0.5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a \"ROC\" function that predicts the labels of the test set examples using different $threshold$ values in \"fit\" and plot the ROC curve. \"ROC\" takes a list containing different $threshold$ parameter values to try and returns (sensitivity, 1-specificity) pair for each $parameter$ value." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def ROC(true, pred, value_list):\n", - " '''\n", - " true: nx1 array of true labels for test set\n", - " pred: nx1 array of predicted labels for test set\n", - " Calculate sensitivity and 1-specificity for each point in value_list\n", - " Return two nX1 arrays: sens (for sensitivities) and spec_ (for 1-specificities)\n", - " '''\n", - " \n", - " \n", - " return sens, spec_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can finally create the confusion matrix and plot the ROC curve for our kNN classifier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# confusion matrix\n", - "conf_matrix(true_classes, predicted_classes)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# ROC curve\n", - "roc_sens, roc_spec_ = ROC(true_classes, predicted_classes, np.arange(0.1, 1.0, 0.1))\n", - "plt.plot(roc_sens, roc_spec_)\n", - "plt.show()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb b/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb deleted file mode 100644 index 02836ab..0000000 --- a/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb +++ /dev/null @@ -1,581 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# *k*-Nearest Neighbor\n", - "\n", - "We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n", - "* dependent variable should be binary (suited for binary classification)\n", - "* number of features (attributes) should be at least 50\n", - "* number of examples (instances) should be between 1,000 - 5,000\n", - "\n", - "A skeleton of a general supervised learning model is provided in \"model.ipynb\". The functions that will be implemented there will be indicated in this notebook. \n", - "\n", - "### Assignment Goals:\n", - "In this assignment, we will:\n", - "* we'll implement 'Euclidean' and 'Manhattan' distance metrics \n", - "* use the validation dataset to find a good value for *k*\n", - "* evaluate our model with respect to performance measures:\n", - " * accuracy, generalization error and ROC curve\n", - "* try to assess if *k*-NN is suitable for the dataset you used\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# GRADING\n", - "\n", - "You will be graded on parts that are marked with **\\#TODO** comments. Read the comments in the code to make sure you don't miss any.\n", - "\n", - "### Mandatory for 478 & 878:\n", - "\n", - "| | Tasks | 478 | 878 |\n", - "|---|----------------------------|-----|-----|\n", - "| 1 | Implement `distance` | 10 | 10 |\n", - "| 2 | Implement `k-NN` methods | 25 | 20 |\n", - "| 3 | Model evaluation | 25 | 20 |\n", - "| 4 | Learning curve | 20 | 20 |\n", - "| 6 | ROC curve analysis | 20 | 20 |\n", - "\n", - "### Mandatory for 878, bonus for 478\n", - "\n", - "| | Tasks | 478 | 878 |\n", - "|---|----------------|-----|-----|\n", - "| 5 | Optimizing *k* | 10 | 10 |\n", - "\n", - "### Bonus for 478/878\n", - "\n", - "| | Tasks | 478 | 878 |\n", - "|---|----------------|-----|-----|\n", - "| 7 | Assess suitability of *k*-NN | 10 | 10 |\n", - "\n", - "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 20 bonus points for 478 students and 10 bonus points for 878 students." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries." - ] - }, - { - "cell_type": "code", - "execution_count": 119, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import matplotlib.pyplot as plt" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Following code makes the Model class and relevant functions available from model.ipynb." - ] - }, - { - "cell_type": "code", - "execution_count": 134, - "metadata": {}, - "outputs": [], - "source": [ - "%run 'model_solution.ipynb'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TASK 1: Implement `distance` function" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Rubric:\n", - "* Euclidean +5, +5\n", - "* Manhattan +5, +5" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Test `distance`" - ] - }, - { - "cell_type": "code", - "execution_count": 136, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Euclidean distance: 1000.0, Manhattan distance: 10000\n" - ] - } - ], - "source": [ - "x = np.array(range(100))\n", - "y = np.array(range(100, 200))\n", - "dist_euclidean = distance(x, y, 'Euclidean')\n", - "dist_manhattan = distance(x, y, 'Manhattan')\n", - "print('Euclidean distance: {}, Manhattan distance: {}'.format(dist_euclidean, dist_manhattan))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TASK 2: Implement $k$-NN Class Methods" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Rubric:\n", - "* correct implementation of fit method +5, +5\n", - "* correct implementation of predict method +20, +15" - ] - }, - { - "cell_type": "code", - "execution_count": 137, - "metadata": {}, - "outputs": [], - "source": [ - "class kNN(Model):\n", - " '''\n", - " Inherits Model class. Implements the k-NN algorithm for classification.\n", - " '''\n", - " \n", - " def fit(self, training_features, training_labels, k, distance_f, **kwargs):\n", - " '''\n", - " Fit the model. This is pretty straightforward for k-NN.\n", - " Args:\n", - " training_features: ndarray\n", - " training_labels: ndarray\n", - " k: int\n", - " distance_f: function\n", - " kwargs: dict\n", - " Contains keyword arguments that will be passed to distance_f\n", - " '''\n", - " # TODO\n", - " # set self.train_features, self.train_labels,self.k, self.distance_f, self.distance_metric\n", - " self.train_features = training_features \n", - " self.train_labels = training_labels\n", - " self.k = k\n", - " self.distance_f = distance_f\n", - " self.distance_metric = kwargs['metric']\n", - " \n", - " return\n", - " \n", - " \n", - " def predict(self, test_features):\n", - " \n", - " test_size = len(test_features)\n", - " train_size = len(self.train_labels)\n", - " \n", - " pred = np.empty(len(test_features))\n", - " # TODO\n", - " # for each point in test points\n", - " for idx in range(test_size):\n", - " point = test_features[idx] \n", - " distances = []\n", - " labels = []\n", - " \n", - "\n", - " for tr_idx in range(train_size):\n", - " train_example = self.train_features[tr_idx]\n", - " train_label = self.train_labels[tr_idx]\n", - " dist = self.distance_f(point, train_example, metric = self.distance_metric)\n", - " distances.append(dist)\n", - " labels.append(train_label)\n", - " \n", - " # get the order of distances\n", - " dist_order = np.argsort(distances)\n", - " # get the labels of k points that are closest to test point\n", - " k_labels = list(np.array(labels)[dist_order[::-1]][:self.k])\n", - " \n", - " # get number of positive labels in k neighbours\n", - " b = k_labels.count(1)\n", - " \n", - " pred[idx] = b/self.k\n", - " \n", - " return pred\n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TASK 3: Build and Evaluate the Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Rubric:\n", - "* Reasonable accuracy values +10, +5\n", - "* Reasonable confidence intervals on the error estimate +10, +10\n", - "* Reasonable confusion matrix +5, +5" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Preprocess the data files and partition the data." - ] - }, - { - "cell_type": "code", - "execution_count": 138, - "metadata": {}, - "outputs": [], - "source": [ - "# initialize the model\n", - "my_model = kNN()\n", - "# obtain features and labels from files\n", - "features, labels = preprocess('../data/madelon.data', '../data/madelon.labels')\n", - "# partition the data set\n", - "val_indices, test_indices, train_indices = partition(features.shape[0], t = 0.3, v = 0.1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Assign a value to *k* and fit the *k*-NN model." - ] - }, - { - "cell_type": "code", - "execution_count": 139, - "metadata": {}, - "outputs": [], - "source": [ - "# pass the training features and labels to the fit method\n", - "kwargs_f = {'metric': 'Euclidean'}\n", - "my_model.fit(features[train_indices], labels[train_indices], k=10, distance_f=distance, **kwargs_f)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Computing the confusion matrix for *k* = 10\n", - "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet." - ] - }, - { - "cell_type": "code", - "execution_count": 140, - "metadata": {}, - "outputs": [], - "source": [ - "# TODO\n", - "\n", - "# get model predictions\n", - "pred_ratios = my_model.predict(features[test_indices])\n", - "# For now, we will consider a data point as predicted in the positive class if more than 0.5 \n", - "# of its k-neighbors are positive.\n", - "threshold = 0.5\n", - "# convert predicted ratios to predicted labels\n", - "pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]\n", - "tp,tn, fp, fn = conf_matrix(labels[test_indices], pred_labels)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the 95% confidence interval on the generalization **error** estimate." - ] - }, - { - "cell_type": "code", - "execution_count": 142, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Accuracy: 0.475\n", - "Confidence interval: 0.49110132745961876-0.5588986725403813\n" - ] - } - ], - "source": [ - "# TODO\n", - "# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell.\n", - "accuracy = (tp+tn)/len(test_indices)\n", - "error = 1 - accuracy\n", - "diff = 0.96 * np.sqrt((error * (1 - error)) / len(test_indices))\n", - "lower_bound = error - diff\n", - "upper_bound = error + diff\n", - "print('Accuracy: {}'.format(accuracy))\n", - "print('Confidence interval: {}-{}'.format(lower_bound, upper_bound))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " ## TASK 4: Plotting a learning curve\n", - " \n", - "A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).\n", - "We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Rubric:\n", - "* Correct training error calculation for different training set sizes +8, +8\n", - "* Correct validation error calculation for different training set sizes +8, +8\n", - "* Reasonable learning curve +4, +4" - ] - }, - { - "cell_type": "code", - "execution_count": 144, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 432x288 with 1 Axes>" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "600 is a good training size\n" - ] - } - ], - "source": [ - "# train using %10, %20, %30, ..., 100% of training data\n", - "training_proportions = np.arange(0.10, 1.01, 0.10)\n", - "\n", - "# TODO\n", - "\n", - "# Calculate error for each entry in training_sizes\n", - "# for training and validation sets and populate\n", - "# error_train and error_val arrays. Each entry in these arrays\n", - "# should correspond to each entry in training_sizes.\n", - "\n", - "error_train = []\n", - "error_val = []\n", - "training_sizes = []\n", - "for proportion in training_proportions:\n", - " \n", - " size = len(train_indices)\n", - " size_avail = np.int(np.ceil(size*proportion))\n", - " training_sizes.append(size_avail)\n", - " idx_avail = train_indices[:size_avail]\n", - " \n", - " kwargs_f = {'metric': 'Euclidean'}\n", - " my_model.fit(features[idx_avail], labels[idx_avail], k = 10, distance_f=distance, **kwargs_f)\n", - " \n", - " val_pred_ratios = my_model.predict(features[val_indices])\n", - " val_pred_labels = [1 if x >= threshold else 0 for x in val_pred_ratios]\n", - " tp,tn, fp, fn = conf_matrix(labels[val_indices], val_pred_labels)\n", - " val_accuracy = (tp+tn)/len(val_indices)\n", - " val_error = 1 - val_accuracy\n", - " error_val.append(val_error)\n", - " \n", - " train_pred_ratios = my_model.predict(features[train_indices])\n", - " train_pred_labels = [1 if x >= threshold else 0 for x in train_pred_ratios]\n", - " tp,tn, fp, fn = conf_matrix(labels[idx_avail], train_pred_labels)\n", - " train_accuracy = (tp+tn)/size_avail\n", - " train_error = 1 - train_accuracy\n", - " error_train.append(train_error)\n", - " \n", - "plt.plot(training_sizes, error_train, 'r', label = 'training_error')\n", - "plt.plot(training_sizes, error_val, 'g', label = 'validation_error')\n", - "plt.legend()\n", - "plt.show()\n", - "print('{} is a good training size'.format(600))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TASK 5: Determining *k*" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Rubric:\n", - "* Increased accuracy with new *k* +5, +5\n", - "* Improved confusion matrix +5, +5" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.\n", - "\n", - "Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Hint: Try values both smaller and larger than 10." - ] - }, - { - "cell_type": "code", - "execution_count": 157, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[0.49666666666666665, 0.55, 0.545, 0.505, 0.495, 0.465, 0.445]" - ] - }, - "execution_count": 157, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# TODO\n", - "k_accuracies = []\n", - "# Change values of k. \n", - "for k in [1, 5, 10, 50, 100, 150, 200]:\n", - " # Calculate accuracies for the validation set.\n", - " my_model.fit(features[train_indices], labels[train_indices], k=k, distance_f=distance, **kwargs_f)\n", - " pred_ratios = my_model.predict(features[val_indices])\n", - " pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]\n", - " tp,tn, fp, fn = conf_matrix(labels[val_indices], pred_labels)\n", - " accuracy = (tp+tn)/len(val_indices)\n", - " k_accuracies.append(accuracy)\n", - "k_accuracies\n", - "# Report a good k value." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TASK 6: ROC curve analysis\n", - "* ROC curve has correct shape +20, +20" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### ROC curve and confusion matrix for the final model\n", - "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in \"model.ipynb\", a \"ROC\" function that predicts the labels of the test set examples using different *threshold* values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0." - ] - }, - { - "cell_type": "code", - "execution_count": 135, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "<Figure size 432x288 with 1 Axes>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# TODO\n", - "# ROC curve\n", - "my_model.fit(features[train_indices], labels[train_indices], k=100, distance_f=distance, **kwargs_f)\n", - "pred_ratios = my_model.predict(features[test_indices])\n", - "\n", - "roc_sens, roc_spec_ = ROC(labels[test_indices], pred_ratios, np.arange(0.1, 1.0, 0.1))\n", - "plt.plot(roc_sens, roc_spec_)\n", - "plt.xlabel('Sensitivity')\n", - "plt.ylabel('Specificity')\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TASK 7: Assess suitability of *k*-NN to your dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use this cell to write about your understanding of why *k*-NN performed well if it did or why not if it didn't. What properties of the dataset affect the performance of the algorithm?" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/ProgrammingAssignment_1/model_solution.ipynb b/ProgrammingAssignment_1/model_solution.ipynb deleted file mode 100644 index c218359..0000000 --- a/ProgrammingAssignment_1/model_solution.ipynb +++ /dev/null @@ -1,274 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# JUPYTER NOTEBOOK TIPS\n", - "\n", - "Each rectangular box is called a cell. \n", - "* Ctrl+ENTER evaluates the current cell; if it contains Python code, it runs the code, if it contains Markdown, it returns rendered text.\n", - "* Alt+ENTER evaluates the current cell and adds a new cell below it.\n", - "* If you click to the left of a cell, you'll notice the frame changes color to blue. You can erase a cell by hitting 'dd' (that's two \"d\"s in a row) when the frame is blue." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Supervised Learning Model Skeleton\n", - "\n", - "We'll use this skeleton for implementing different supervised learning algorithms." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "class Model:\n", - " \n", - " def fit(self):\n", - " \n", - " raise NotImplementedError\n", - " \n", - " def predict(self, test_points):\n", - " raise NotImplementedError" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "def preprocess(feature_file, label_file):\n", - " '''\n", - " Args:\n", - " feature_file: str \n", - " file containing features\n", - " label_file: str\n", - " file containing labels\n", - " Returns:\n", - " features: ndarray\n", - " nxd features\n", - " labels: ndarray\n", - " nx1 labels\n", - " '''\n", - " \n", - " # read in features and labels\n", - " features = np.genfromtxt(feature_file)\n", - " labels = np.genfromtxt(label_file)\n", - " \n", - " return features, labels" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "def partition(size, t, v = 0):\n", - " '''\n", - " Args:\n", - " size: int\n", - " number of examples in the whole dataset\n", - " t: float\n", - " proportion kept for test\n", - " v: float\n", - " proportion kept for validation\n", - " Returns:\n", - " test_indices: ndarray\n", - " 1D array containing test set indices\n", - " val_indices: ndarray\n", - " 1D array containing validation set indices\n", - " '''\n", - " \n", - " # number of test and validation examples\n", - " t_size = np.int(np.ceil(size*t))\n", - " v_size = np.int(np.ceil(size*v))\n", - "\n", - " # shuffle the indices\n", - " permuted = np.random.permutation(size)\n", - " \n", - " # spare the first t_size for test\n", - " test_indices = permuted[:t_size]\n", - " # and the next v_size for validation\n", - " val_indices = permuted[t_size+1:t_size+v_size+1]\n", - " train_indices = np.delete(np.arange(size), np.append(test_indices, val_indices), 0)\n", - " \n", - " return test_indices, val_indices, train_indices" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TASK 1: Implement `distance` function" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\"distance\" function will be used in calculating cost of *k*-NN. It should take two data points and the name of the metric and return a scalar value." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "#TODO: Programming Assignment 1\n", - "def distance(x, y, metric):\n", - " '''\n", - " Args:\n", - " x: ndarray \n", - " 1D array containing coordinates for a point\n", - " y: ndarray\n", - " 1D array containing coordinates for a point\n", - " metric: str\n", - " Euclidean, Manhattan \n", - " Returns:\n", - " dist: float\n", - " '''\n", - " if metric == 'Euclidean':\n", - " dist = np.sqrt(np.sum(np.square((x-y))))\n", - " elif metric == 'Manhattan':\n", - " dist = np.sum(abs(x-y))\n", - " else:\n", - " raise ValueError('{} is not a valid metric.'.format(metric))\n", - " return dist # scalar distance btw x and y" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## General supervised learning performance related functions " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Implement the \"conf_matrix\" function that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "# TODO: Programming Assignment 1\n", - "\n", - "def conf_matrix(true, pred):\n", - " '''\n", - " Args: \n", - " true: ndarray\n", - " nx1 array of true labels for test set\n", - " pred: ndarray \n", - " nx1 array of predicted labels for test set\n", - " Returns:\n", - " ndarray\n", - " '''\n", - " \n", - " tp = tn = fp = fn = 0\n", - " # calculate true positives (tp), true negatives(tn)\n", - " # false positives (fp) and false negatives (fn)\n", - " \n", - " size = len(true)\n", - " for i in range(size):\n", - " if true[i]==1:\n", - " if pred[i] > 0:\n", - " \n", - " tp += 1\n", - " else:\n", - " \n", - " fn += 1\n", - " else:\n", - " if pred[i] == 0:\n", - " \n", - " tn += 1 \n", - " else:\n", - " \n", - " fp += 1 \n", - " \n", - " # returns the confusion matrix as numpy.ndarray\n", - " return np.array([tp,tn, fp, fn])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "# TODO: Programming Assignment 1\n", - "\n", - "def ROC(true_labels, preds, value_list):\n", - " '''\n", - " Args:\n", - " true_labels: ndarray\n", - " 1D array containing true labels\n", - " preds: ndarray\n", - " 1D array containing thresholded value (e.g. proportion of positive neighbors in kNN)\n", - " value_list: ndarray\n", - " 1D array containing different threshold values\n", - " Returns:\n", - " sens: ndarray\n", - " 1D array containing sensitivities\n", - " spec_: ndarray\n", - " 1D array containing 1-specifities\n", - " '''\n", - " \n", - " # use conf_matrix to calculate tp, tn, fp, fn\n", - " # calculate sensitivity, 1-specificity\n", - " # return two arrays\n", - " sens = []\n", - " spec_ = []\n", - " for threshold in value_list:\n", - " pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]\n", - " tp,tn, fp, fn = conf_matrix(true_labels, pred_labels) \n", - " se = tp/(tp+fn)\n", - " sens.append(se)\n", - " spec = tn/(tn+fp)\n", - " spec_.append(1 - spec)\n", - " \n", - " return np.array(sens), np.array(spec_)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} -- GitLab