{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# *k*-Nearest Neighbor\n", "\n", "We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n", "* dependent variable should be binary (suited for binary classification)\n", "* number of features (attributes) should be at least 50\n", "* number of examples (instances) should be between 1,000 - 5,000\n", "\n", "A skeleton of a general supervised learning model is provided in \"model.ipynb\". Please look through it and complete the \"preprocess\" and \"partition\" methods. \n", "\n", "### Assignment Goals:\n", "In this assignment, we will:\n", "* learn to split a dataset into training/validation/test partitions \n", "* use the validation dataset to find a good value for *k*\n", "* Having found the \"best\" *k*, we'll obtain final performance measures:\n", " * accuracy, generalization error and ROC curve\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# GRADING\n", "\n", "You will be graded on parts that are marked with **\\#TODO** comments. Read the comments in the code to make sure you don't miss any.\n", "\n", "### Mandatory for 478 & 878:\n", "\n", "| | Tasks | 478 | 878 |\n", "|---|----------------------------|-----|-----|\n", "| 1 | Implement `distance` | 10 | 10 |\n", "| 2 | Implement `k-NN` methods | 25 | 20 |\n", "| 3 | Model evaluation | 25 | 20 |\n", "| 4 | Learning curve | 20 | 20 |\n", "| 6 | ROC curve analysis | 20 | 20 |\n", "\n", "### Mandatory for 878, bonus for 478\n", "\n", "| | Tasks | 478 | 878 |\n", "|---|----------------|-----|-----|\n", "| 5 | Optimizing *k* | 10 | 10 |\n", "\n", "\n", "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 10 bonus points for 478 students." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following code makes the Model class and relevant functions available from model.ipynb." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run 'model.ipynb'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TASK 1: Implement `distance` function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rubric:\n", "* Euclidean +5, +5\n", "* Hamming +5, +5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test `distance`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.array(range(100))\n", "y = np.array(range(100, 200))\n", "dist_euclidean = distance(x, y, 'Euclidean')\n", "dist_hamming = distance(x, y, 'Hamming')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TASK 2: Implement $k$-NN Class Methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rubric:\n", "* correct implementation of fit method +5, +5\n", "* correct implementation of predict method +20, +15" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class kNN(Model):\n", " '''\n", " Inherits Model class. Implements the k-NN algorithm for classification.\n", " '''\n", " \n", " def fit(self, k, distance_f, **kwargs):\n", " '''\n", " Fit the model. This is pretty straightforward for k-NN.\n", " '''\n", " # TODO\n", " # set self.k, self.distance_f, self.distance_metric\n", " raise NotImplementedError\n", " \n", " return\n", " \n", " \n", " def predict(self, test_indices):\n", " \n", " raise NotImplementedError\n", " \n", " pred = []\n", " # TODO\n", " \n", " # for each point in test points\n", " # use your implementation of distance function\n", " # distance_f(..., distance_metric)\n", " # to find the labels of k-nearest neighbors. \n", " \n", " # Find the ratio of the positive labels\n", " # and append to pred with pred.append(ratio).\n", " \n", "\n", " return np.array(pred)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TASK 3: Build and Evaluate the Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rubric:\n", "* Reasonable accuracy values +10, +5\n", "* Reasonable confidence intervals on the error estimate +10, +10\n", "* Reasonable confusion matrix +5, +5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember you need to provide values to *t*, *v* parameters for \"partition\" function and to *feature_file* and *label_file* for \"preprocess\" function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# populate the keyword arguments dictionary kwargs\n", "kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': ..., 'label_file': ...}\n", "# initialize the model\n", "my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assign a value to *k* and fit the *k*-NN model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kwargs_f = {'metric': 'Euclidean'}\n", "my_model.fit(k = 10, distance_f=distance, **kwargs_f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "final_labels = my_model.predict(my_model.test_indices)\n", "\n", "# For now, we will consider a data point as predicted in the positive class if more than 0.5 \n", "# of its k-neighbors are positive.\n", "threshold = 0.5\n", "\n", "# TODO\n", "# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing the confusion matrix for *k* = 10\n", "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO\n", "conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold = 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## TASK 4: Plotting a learning curve\n", " \n", "A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).\n", "We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rubric:\n", "* Correct training error calculation for different training set sizes +8, +8\n", "* Correct validation error calculation for different training set sizes +8, +8\n", "* Reasonable learning curve +4, +4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size\n", "training_sizes = np.arange(50, my_model.train_size + 1, 50)\n", "\n", "# TODO\n", "\n", "# Calculate error for each entry in training_sizes\n", "# for training and validation sets and populate\n", "# error_train and error_val arrays. Each entry in these arrays\n", "# should correspond to each entry in training_sizes.\n", "\n", "plt.plot(training_sizes, error_train, 'r', label = 'training_error')\n", "plt.plot(training_sizes, error_val, 'g', label = 'validation_error')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TASK 5: Determining *k*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rubric:\n", "* Increased accuracy with new *k* +5, +5\n", "* Improved confusion matrix +5, +5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.\n", "\n", "Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Report confusion matrix for the new value of *k*." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# TODO\n", "\n", "# Change values of k. \n", "# Calculate accuracies for the validation set.\n", "# Report a good k value.\n", "# Calculate the confusion matrix for new k." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TASK 6: ROC curve analysis\n", "* Correct implementation +20, +20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ROC curve and confusion matrix for the final model\n", "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in \"model.ipynb\", a \"ROC\" function that predicts the labels of the test set examples using different *threshold* values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ROC curve\n", "roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))\n", "plt.plot(roc_sens, roc_spec_)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }