{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# *k*-Nearest Neighbor\n",
    "\n",
    "We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n",
    "* dependent variable should be binary (suited for binary classification)\n",
    "* number of features (attributes) should be at least 50\n",
    "* number of examples (instances) should be between 1,000 - 5,000\n",
    "\n",
    "A skeleton of a general supervised learning model is provided in \"model.ipynb\". Please look through it and complete the \"preprocess\" and \"partition\" methods. \n",
    "\n",
    "### Assignment Goals:\n",
    "In this assignment, we will:\n",
    "* learn to split a dataset into training/validation/test partitions \n",
    "* use the validation dataset to find a good value for *k*\n",
    "* Having found the \"best\" *k*, we'll obtain final performance measures:\n",
    "    * accuracy, generalization error and ROC curve\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GRADING\n",
    "\n",
    "You will be graded on parts that are marked with **\\#TODO** comments. Read the comments in the code to make sure you don't miss any.\n",
    "\n",
    "### Mandatory for 478 & 878:\n",
    "\n",
    "|   | Tasks                      | 478 | 878 |\n",
    "|---|----------------------------|-----|-----|\n",
    "| 1 | Implement `distance`       |  10 |  10 |\n",
    "| 2 | Implement `k-NN` methods   |  25 |  20 |\n",
    "| 3 | Model evaluation           |  25 |  20 |\n",
    "| 4 | Learning curve             |  20 |  20 |\n",
    "| 6 | ROC curve analysis         |  20 |  20 |\n",
    "\n",
    "### Mandatory for 878, bonus for 478\n",
    "\n",
    "|   | Tasks          | 478 | 878 |\n",
    "|---|----------------|-----|-----|\n",
    "| 5 | Optimizing *k* | 10  | 10  |\n",
    "\n",
    "\n",
    "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 10 bonus points for 478 students."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following code makes the Model class and relevant functions available from model.ipynb."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run 'model.ipynb'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TASK 1: Implement `distance` function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rubric:\n",
    "* Euclidean +5, +5\n",
    "* Hamming +5, +5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test `distance`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.array(range(100))\n",
    "y = np.array(range(100, 200))\n",
    "dist_euclidean = distance(x, y, 'Euclidean')\n",
    "dist_hamming = distance(x, y, 'Hamming')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TASK 2: Implement $k$-NN Class Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rubric:\n",
    "* correct implementation of fit method +5, +5\n",
    "* correct implementation of predict method +20, +15"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class kNN(Model):\n",
    "    '''\n",
    "    Inherits Model class. Implements the k-NN algorithm for classification.\n",
    "    '''\n",
    "       \n",
    "    def fit(self, k, distance_f, **kwargs):\n",
    "        '''\n",
    "        Fit the model. This is pretty straightforward for k-NN.\n",
    "        '''\n",
    "        # TODO\n",
    "        # set self.k, self.distance_f, self.distance_metric\n",
    "        raise NotImplementedError\n",
    "        \n",
    "        return\n",
    "    \n",
    "    \n",
    "    def predict(self, test_indices):\n",
    "        \n",
    "        raise NotImplementedError\n",
    "        \n",
    "        pred = []\n",
    "        # TODO\n",
    "        \n",
    "        # for each point in test points\n",
    "        # use your implementation of distance function\n",
    "        #  distance_f(..., distance_metric)\n",
    "        # to find the labels of k-nearest neighbors. \n",
    "        \n",
    "        # Find the ratio of the positive labels\n",
    "        # and append to pred with pred.append(ratio).\n",
    "        \n",
    "\n",
    "        return np.array(pred)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TASK 3: Build and Evaluate the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rubric:\n",
    "* Reasonable accuracy values +10, +5\n",
    "* Reasonable confidence intervals on the error estimate +10, +10\n",
    "* Reasonable confusion matrix +5, +5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember you need to provide values to *t*, *v* parameters for \"partition\" function and to *feature_file* and *label_file* for \"preprocess\" function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# populate the keyword arguments dictionary kwargs\n",
    "kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': ..., 'label_file': ...}\n",
    "# initialize the model\n",
    "my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assign a value to *k* and fit the *k*-NN model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kwargs_f = {'metric': 'Euclidean'}\n",
    "my_model.fit(k = 10, distance_f=distance, **kwargs_f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "final_labels = my_model.predict(my_model.test_indices)\n",
    "\n",
    "# For now, we will consider a data point as predicted in the positive class if more than 0.5 \n",
    "# of its k-neighbors are positive.\n",
    "threshold = 0.5\n",
    "\n",
    "# TODO\n",
    "# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the confusion matrix for *k* = 10\n",
    "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO\n",
    "conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold = 0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## TASK 4: Plotting a learning curve\n",
    " \n",
    "A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).\n",
    "We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rubric:\n",
    "* Correct training error calculation for different training set sizes +8, +8\n",
    "* Correct validation error calculation for different training set sizes +8, +8\n",
    "* Reasonable learning curve +4, +4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size\n",
    "training_sizes = np.arange(50, my_model.train_size + 1, 50)\n",
    "\n",
    "# TODO\n",
    "\n",
    "# Calculate error for each entry in training_sizes\n",
    "# for training and validation sets and populate\n",
    "# error_train and error_val arrays. Each entry in these arrays\n",
    "# should correspond to each entry in training_sizes.\n",
    "\n",
    "plt.plot(training_sizes, error_train, 'r', label = 'training_error')\n",
    "plt.plot(training_sizes, error_val, 'g', label = 'validation_error')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TASK 5: Determining *k*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rubric:\n",
    "* Increased accuracy with new *k* +5, +5\n",
    "* Improved confusion matrix +5, +5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.\n",
    "\n",
    "Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Report confusion matrix for the new value of *k*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO\n",
    "\n",
    "# Change values of k. \n",
    "# Calculate accuracies for the validation set.\n",
    "# Report a good k value.\n",
    "# Calculate the confusion matrix for new k."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TASK 6: ROC curve analysis\n",
    "* Correct implementation +20, +20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ROC curve and confusion matrix for the final model\n",
    "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in \"model.ipynb\", a \"ROC\" function that predicts the labels of the test set examples using different *threshold* values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ROC curve\n",
    "roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))\n",
    "plt.plot(roc_sens, roc_spec_)\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}