{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# *k*-Nearest Neighbor\n",
"\n",
"We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n",
"* dependent variable should be binary (suited for binary classification)\n",
"* number of features (attributes) should be at least 50\n",
"* number of examples (instances) should be between 1,000 - 5,000\n",
"\n",
"A skeleton of a general supervised learning model is provided in \"model.ipynb\". Please look through it and complete the \"preprocess\" and \"partition\" methods. \n",
"\n",
"### Assignment Goals:\n",
"In this assignment, we will:\n",
"* learn to split a dataset into training/validation/test partitions \n",
"* use the validation dataset to find a good value for *k*\n",
"* Having found the \"best\" *k*, we'll obtain final performance measures:\n",
" * accuracy, generalization error and ROC curve\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# GRADING\n",
"\n",
"You will be graded on parts that are marked with **\\#TODO** comments. Read the comments in the code to make sure you don't miss any.\n",
"\n",
"### Mandatory for 478 & 878:\n",
"\n",
"| | Tasks | 478 | 878 |\n",
"|---|----------------------------|-----|-----|\n",
"| 1 | Implement `distance` | 10 | 10 |\n",
"| 2 | Implement `k-NN` methods | 25 | 20 |\n",
"| 3 | Model evaluation | 25 | 20 |\n",
"| 4 | Learning curve | 20 | 20 |\n",
"| 6 | ROC curve analysis | 20 | 20 |\n",
"\n",
"### Mandatory for 878, bonus for 478\n",
"\n",
"| | Tasks | 478 | 878 |\n",
"|---|----------------|-----|-----|\n",
"| 5 | Optimizing *k* | 10 | 10 |\n",
"\n",
"\n",
"Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 10 bonus points for 478 students."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Following code makes the Model class and relevant functions available from model.ipynb."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%run 'model.ipynb'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TASK 1: Implement `distance` function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rubric:\n",
"* Euclidean +5, +5\n",
"* Hamming +5, +5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test `distance`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.array(range(100))\n",
"y = np.array(range(100, 200))\n",
"dist_euclidean = distance(x, y, 'Euclidean')\n",
"dist_hamming = distance(x, y, 'Hamming')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TASK 2: Implement $k$-NN Class Methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rubric:\n",
"* correct implementation of fit method +5, +5\n",
"* correct implementation of predict method +20, +15"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class kNN(Model):\n",
" '''\n",
" Inherits Model class. Implements the k-NN algorithm for classification.\n",
" '''\n",
" \n",
" def fit(self, k, distance_f, **kwargs):\n",
" '''\n",
" Fit the model. This is pretty straightforward for k-NN.\n",
" '''\n",
" # TODO\n",
" # set self.k, self.distance_f, self.distance_metric\n",
" raise NotImplementedError\n",
" \n",
" return\n",
" \n",
" \n",
" def predict(self, test_indices):\n",
" \n",
" raise NotImplementedError\n",
" \n",
" pred = []\n",
" # TODO\n",
" \n",
" # for each point in test points\n",
" # use your implementation of distance function\n",
" # distance_f(..., distance_metric)\n",
" # to find the labels of k-nearest neighbors. \n",
" \n",
" # Find the ratio of the positive labels\n",
" # and append to pred with pred.append(ratio).\n",
" \n",
"\n",
" return np.array(pred)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TASK 3: Build and Evaluate the Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rubric:\n",
"* Reasonable accuracy values +10, +5\n",
"* Reasonable confidence intervals on the error estimate +10, +10\n",
"* Reasonable confusion matrix +5, +5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember you need to provide values to *t*, *v* parameters for \"partition\" function and to *feature_file* and *label_file* for \"preprocess\" function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# populate the keyword arguments dictionary kwargs\n",
"kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': ..., 'label_file': ...}\n",
"# initialize the model\n",
"my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assign a value to *k* and fit the *k*-NN model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"kwargs_f = {'metric': 'Euclidean'}\n",
"my_model.fit(k = 10, distance_f=distance, **kwargs_f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"final_labels = my_model.predict(my_model.test_indices)\n",
"\n",
"# For now, we will consider a data point as predicted in the positive class if more than 0.5 \n",
"# of its k-neighbors are positive.\n",
"threshold = 0.5\n",
"\n",
"# TODO\n",
"# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computing the confusion matrix for *k* = 10\n",
"Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO\n",
"conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold = 0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ## TASK 4: Plotting a learning curve\n",
" \n",
"A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).\n",
"We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rubric:\n",
"* Correct training error calculation for different training set sizes +8, +8\n",
"* Correct validation error calculation for different training set sizes +8, +8\n",
"* Reasonable learning curve +4, +4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size\n",
"training_sizes = np.arange(50, my_model.train_size + 1, 50)\n",
"\n",
"# TODO\n",
"\n",
"# Calculate error for each entry in training_sizes\n",
"# for training and validation sets and populate\n",
"# error_train and error_val arrays. Each entry in these arrays\n",
"# should correspond to each entry in training_sizes.\n",
"\n",
"plt.plot(training_sizes, error_train, 'r', label = 'training_error')\n",
"plt.plot(training_sizes, error_val, 'g', label = 'validation_error')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TASK 5: Determining *k*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rubric:\n",
"* Increased accuracy with new *k* +5, +5\n",
"* Improved confusion matrix +5, +5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.\n",
"\n",
"Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Report confusion matrix for the new value of *k*."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# TODO\n",
"\n",
"# Change values of k. \n",
"# Calculate accuracies for the validation set.\n",
"# Report a good k value.\n",
"# Calculate the confusion matrix for new k."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TASK 6: ROC curve analysis\n",
"* Correct implementation +20, +20"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ROC curve and confusion matrix for the final model\n",
"ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in \"model.ipynb\", a \"ROC\" function that predicts the labels of the test set examples using different *threshold* values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ROC curve\n",
"roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))\n",
"plt.plot(roc_sens, roc_spec_)\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}