clean

88f10a75 · Zeynep Hakguder · 658639d5 · 658639d5 · 658639d5 · 658639d5
Commit 88f10a75 authored 6 years ago by Zeynep Hakguder
--- a/ProgrammingAssignment_1/ProgrammingAssignment1-Solution.ipynb
+++ b/ProgrammingAssignment_1/ProgrammingAssignment1-Solution.ipynb
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# k-Nearest Neighbor"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "You can use numpy for array operations and matplpotlib for plotting for this assignment. Please do not add other libraries."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 247,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "import matplotlib.pyplot as plt"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Following code makes the Model class and relevant functions available from model.ipynb."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 256,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%run 'model-Solution.ipynb'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Choice of distance metric plays an important role in the performance of kNN. Let's start by implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 257,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def distance(x, y, metric):\n",
-    "    '''\n",
-    "    x: a 1xd array\n",
-    "    y: a 1xd array\n",
-    "    metric: Euclidean, Hamming, etc.\n",
-    "    '''\n",
-    "    #raise NotImplementedError\n",
-    "    \n",
-    "    if metric == 'Euclidean':\n",
-    "        dist = np.sqrt(np.sum(np.square((x-y))))\n",
-    "    \n",
-    "    ####################################\n",
-    "    return dist # scalar distance btw x and y"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We can implement our kNN classifier. kNN class inherits Model class. Implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes $k$ as an argument. \"predict\" takes as input the feature vector for a single test point and outputs the predicted class, and the proportion of predicted class labels in $k$ nearest neighbors."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 283,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "class kNN(Model):\n",
-    "\n",
-    "    def fit(self, k, distance_f, **kwargs):\n",
-    "        \n",
-    "        #raise NotImplementedError\n",
-    "        \n",
-    "        self.k = k\n",
-    "        self.distance_f = distance_f\n",
-    "        self.distance_metric = kwargs['metric']\n",
-    "        \n",
-    "                \n",
-    "        #######################\n",
-    "        return\n",
-    "    # vary the threshold value for ROC analysis\n",
-    "    def predict(self, test_points):\n",
-    "        \n",
-    "        chosen_labels = []\n",
-    "        for test_point in self.features[test_indices]:\n",
-    "            #raise NotImplementedError\n",
-    "            tmp_dist = [np.inf] * self.k\n",
-    "            distances = []\n",
-    "\n",
-    "            labels = []\n",
-    "            for index in self.training_indices:\n",
-    "                dist = self.distance_f(self.features[index], test_point, self.distance_metric)\n",
-    "                distances.append(dist)\n",
-    "                labels.append(self.labels[index])\n",
-    "            a_order = np.argsort(distances)\n",
-    "            tmp_labels = list(np.array(labels)[a_order[::-1]][:self.k])\n",
-    "            b = tmp_labels.count(1)\n",
-    "            chosen_labels.append(b/self.k)\n",
-    "            \n",
-    "            ##########################\n",
-    "            # return the predicted class label and the following ratio: \n",
-    "            # number of points that have the same label as the test point / k\n",
-    "        return np.array(chosen_labels)\n",
-    "    "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for \"partition\" function and to $file\\_path$ for \"preprocess\" function."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 284,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# populate the keyword arguments dictionary kwargs\n",
-    "kwargs = {'p': 0.3, 'v': 0.1, 'seed': 123, 'file_path': 'madelon_train'}\n",
-    "# initialize the model\n",
-    "my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Assign a value to $k$ and fit the kNN model. You do not need to change the value of the $threshold$ parameter yet."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 285,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "kwargs_f = {'metric': 'Euclidean'}\n",
-    "my_model.fit(k = 10, distance_f=distance, **kwargs_f)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Evaluate your model on the test data and report your accuracy. Also, calculate and report the confidence interval on the generalization error estimate."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 286,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "final_labels = my_model.predict(my_model.test_indices)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 289,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([196, 106, 193, 105])"
-      ]
-     },
-     "execution_count": 289,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# You should see array([ 196, 106,  193, 105]) with seed 123\n",
-    "conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold= 0.5)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a \"ROC\" function that predicts the labels of the test set examples using different $threshold$ values in \"fit\" and plot the ROC curve. \"ROC\" takes a list containing different $threshold$ parameter values to try and returns (sensitivity, 1-specificity) pair for each $parameter$ value."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def ROC(true, pred, value_list):\n",
-    "    '''\n",
-    "    true: nx1 array of true labels for test set\n",
-    "    pred: nx1 array of predicted labels for test set\n",
-    "    Calculate sensitivity and 1-specificity for each point in value_list\n",
-    "    Return two nX1 arrays: sens (for sensitivities) and spec_ (for 1-specificities)\n",
-    "    '''\n",
-    "    \n",
-    "    \n",
-    "    return sens, spec_"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We can finally create the confusion matrix and plot the ROC curve for our kNN classifier."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# confusion matrix\n",
-    "conf_matrix(true_classes, predicted_classes)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# ROC curve\n",
-    "roc_sens, roc_spec_ = ROC(true_classes, predicted_classes, np.arange(0.1, 1.0, 0.1))\n",
-    "plt.plot(roc_sens, roc_spec_)\n",
-    "plt.show()"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
-%% Cell type:markdown id: tags:
-
-# k-Nearest Neighbor
-
-%% Cell type:markdown id: tags:
-
-You can use numpy for array operations and matplpotlib for plotting for this assignment. Please do not add other libraries.
-
-%% Cell type:code id: tags:
-
-``` python
-import numpy as np
-import matplotlib.pyplot as plt
-```
-
-%% Cell type:markdown id: tags:
-
-Following code makes the Model class and relevant functions available from model.ipynb.
-
-%% Cell type:code id: tags:
-
-``` python
-%run 'model-Solution.ipynb'
-```
-
-%% Cell type:markdown id: tags:
-
-Choice of distance metric plays an important role in the performance of kNN. Let's start by implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.
-
-%% Cell type:code id: tags:
-
-``` python
-def distance(x, y, metric):
-    '''
-    x: a 1xd array
-    y: a 1xd array
-    metric: Euclidean, Hamming, etc.
-    '''
-    #raise NotImplementedError
-
-    if metric == 'Euclidean':
-        dist = np.sqrt(np.sum(np.square((x-y))))
-
-    ####################################
-    return dist # scalar distance btw x and y
-```
-
-%% Cell type:markdown id: tags:
-
-We can implement our kNN classifier. kNN class inherits Model class. Implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input the feature vector for a single test point and outputs the predicted class, and the proportion of predicted class labels in $k$ nearest neighbors.
-
-%% Cell type:code id: tags:
-
-``` python
-class kNN(Model):
-
-    def fit(self, k, distance_f, **kwargs):
-
-        #raise NotImplementedError
-
-        self.k = k
-        self.distance_f = distance_f
-        self.distance_metric = kwargs['metric']
-
-
-        #######################
-        return
-    # vary the threshold value for ROC analysis
-    def predict(self, test_points):
-
-        chosen_labels = []
-        for test_point in self.features[test_indices]:
-            #raise NotImplementedError
-            tmp_dist = [np.inf] * self.k
-            distances = []
-
-            labels = []
-            for index in self.training_indices:
-                dist = self.distance_f(self.features[index], test_point, self.distance_metric)
-                distances.append(dist)
-                labels.append(self.labels[index])
-            a_order = np.argsort(distances)
-            tmp_labels = list(np.array(labels)[a_order[::-1]][:self.k])
-            b = tmp_labels.count(1)
-            chosen_labels.append(b/self.k)
-
-            ##########################
-            # return the predicted class label and the following ratio:
-            # number of points that have the same label as the test point / k
-        return np.array(chosen_labels)
-
-```
-
-%% Cell type:markdown id: tags:
-
-It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.
-
-%% Cell type:code id: tags:
-
-``` python
-# populate the keyword arguments dictionary kwargs
-kwargs = {'p': 0.3, 'v': 0.1, 'seed': 123, 'file_path': 'madelon_train'}
-# initialize the model
-my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)
-```
-
-%% Cell type:markdown id: tags:
-
-Assign a value to $k$ and fit the kNN model. You do not need to change the value of the $threshold$ parameter yet.
-
-%% Cell type:code id: tags:
-
-``` python
-kwargs_f = {'metric': 'Euclidean'}
-my_model.fit(k = 10, distance_f=distance, **kwargs_f)
-```
-
-%% Cell type:markdown id: tags:
-
-Evaluate your model on the test data and report your accuracy. Also, calculate and report the confidence interval on the generalization error estimate.
-
-%% Cell type:code id: tags:
-
-``` python
-final_labels = my_model.predict(my_model.test_indices)
-```
-
-%% Cell type:markdown id: tags:
-
-Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray.
-
-%% Cell type:code id: tags:
-
-``` python
-# You should see array([ 196, 106,  193, 105]) with seed 123
-conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold= 0.5)
-```
-
-%% Output
-
-    array([196, 106, 193, 105])
-
-%% Cell type:markdown id: tags:
-
-ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "fit" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns (sensitivity, 1-specificity) pair for each $parameter$ value.
-
-%% Cell type:code id: tags:
-
-``` python
-def ROC(true, pred, value_list):
-    '''
-    true: nx1 array of true labels for test set
-    pred: nx1 array of predicted labels for test set
-    Calculate sensitivity and 1-specificity for each point in value_list
-    Return two nX1 arrays: sens (for sensitivities) and spec_ (for 1-specificities)
-    '''
-
-
-    return sens, spec_
-```
-
-%% Cell type:markdown id: tags:
-
-We can finally create the confusion matrix and plot the ROC curve for our kNN classifier.
-
-%% Cell type:code id: tags:
-
-``` python
-# confusion matrix
-conf_matrix(true_classes, predicted_classes)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# ROC curve
-roc_sens, roc_spec_ = ROC(true_classes, predicted_classes, np.arange(0.1, 1.0, 0.1))
-plt.plot(roc_sens, roc_spec_)
-plt.show()
-```
--- a/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb
+++ b/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb
--- a/ProgrammingAssignment_1/model_solution.ipynb
+++ b/ProgrammingAssignment_1/model_solution.ipynb
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# JUPYTER NOTEBOOK TIPS\n",
-    "\n",
-    "Each rectangular box is called a cell. \n",
-    "* Ctrl+ENTER evaluates the current cell; if it contains Python code, it runs the code, if it contains Markdown, it returns rendered text.\n",
-    "* Alt+ENTER evaluates the current cell and adds a new cell below it.\n",
-    "* If you click to the left of a cell, you'll notice the frame changes color to blue. You can erase a cell by hitting 'dd' (that's two \"d\"s in a row) when the frame is blue."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Supervised Learning Model Skeleton\n",
-    "\n",
-    "We'll use this skeleton for implementing different supervised learning algorithms."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "class Model:\n",
-    "        \n",
-    "    def fit(self):\n",
-    "        \n",
-    "        raise NotImplementedError\n",
-    "    \n",
-    "    def predict(self, test_points):\n",
-    "        raise NotImplementedError"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def preprocess(feature_file, label_file):\n",
-    "    '''\n",
-    "    Args:\n",
-    "        feature_file: str \n",
-    "            file containing features\n",
-    "        label_file: str\n",
-    "            file containing labels\n",
-    "    Returns:\n",
-    "        features: ndarray\n",
-    "            nxd features\n",
-    "        labels: ndarray\n",
-    "            nx1 labels\n",
-    "    '''\n",
-    "    \n",
-    "    # read in features and labels\n",
-    "    features = np.genfromtxt(feature_file)\n",
-    "    labels = np.genfromtxt(label_file)\n",
-    "    \n",
-    "    return features, labels"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def partition(size, t, v = 0):\n",
-    "    '''\n",
-    "    Args:\n",
-    "        size: int\n",
-    "            number of examples in the whole dataset\n",
-    "        t: float\n",
-    "            proportion kept for test\n",
-    "        v: float\n",
-    "            proportion kept for validation\n",
-    "    Returns:\n",
-    "        test_indices: ndarray\n",
-    "            1D array containing test set indices\n",
-    "        val_indices: ndarray\n",
-    "            1D array containing validation set indices\n",
-    "    '''\n",
-    "    \n",
-    "    # number of test and validation examples\n",
-    "    t_size = np.int(np.ceil(size*t))\n",
-    "    v_size = np.int(np.ceil(size*v))\n",
-    "\n",
-    "    # shuffle the indices\n",
-    "    permuted = np.random.permutation(size)\n",
-    "    \n",
-    "    # spare the first t_size for test\n",
-    "    test_indices =  permuted[:t_size]\n",
-    "    # and the next v_size for validation\n",
-    "    val_indices = permuted[t_size+1:t_size+v_size+1]\n",
-    "    train_indices = np.delete(np.arange(size), np.append(test_indices, val_indices), 0)\n",
-    "    \n",
-    "    return test_indices, val_indices, train_indices"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## TASK 1: Implement `distance` function"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "\"distance\" function will be used in calculating cost of *k*-NN. It should take two data points and the name of the metric and return a scalar value."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#TODO: Programming Assignment 1\n",
-    "def distance(x, y, metric):\n",
-    "    '''\n",
-    "    Args:\n",
-    "        x: ndarray \n",
-    "            1D array containing coordinates for a point\n",
-    "        y: ndarray\n",
-    "            1D array containing coordinates for a point\n",
-    "        metric: str\n",
-    "            Euclidean, Manhattan \n",
-    "    Returns:\n",
-    "        dist: float\n",
-    "    '''\n",
-    "    if metric == 'Euclidean':\n",
-    "        dist = np.sqrt(np.sum(np.square((x-y))))\n",
-    "    elif metric == 'Manhattan':\n",
-    "        dist = np.sum(abs(x-y))\n",
-    "    else:\n",
-    "        raise ValueError('{} is not a valid metric.'.format(metric))\n",
-    "    return dist # scalar distance btw x and y"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## General supervised learning performance related functions "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Implement the \"conf_matrix\" function that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# TODO: Programming Assignment 1\n",
-    "\n",
-    "def conf_matrix(true, pred):\n",
-    "    '''\n",
-    "    Args:    \n",
-    "        true:  ndarray\n",
-    "            nx1 array of true labels for test set\n",
-    "        pred: ndarray \n",
-    "            nx1 array of predicted labels for test set\n",
-    "    Returns:\n",
-    "        ndarray\n",
-    "    '''\n",
-    "        \n",
-    "    tp = tn = fp = fn = 0\n",
-    "    # calculate true positives (tp), true negatives(tn)\n",
-    "    # false positives (fp) and false negatives (fn)\n",
-    "    \n",
-    "    size = len(true)\n",
-    "    for i in range(size):\n",
-    "        if true[i]==1:\n",
-    "            if pred[i] > 0:\n",
-    "               \n",
-    "                tp += 1\n",
-    "            else:\n",
-    "               \n",
-    "                fn += 1\n",
-    "        else:\n",
-    "            if pred[i] == 0:\n",
-    "               \n",
-    "                tn += 1    \n",
-    "            else:\n",
-    "               \n",
-    "                fp += 1                            \n",
-    "    \n",
-    "    # returns the confusion matrix as numpy.ndarray\n",
-    "    return np.array([tp,tn, fp, fn])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# TODO: Programming Assignment 1\n",
-    "\n",
-    "def ROC(true_labels, preds, value_list):\n",
-    "    '''\n",
-    "    Args:\n",
-    "        true_labels: ndarray\n",
-    "            1D array containing true labels\n",
-    "        preds: ndarray\n",
-    "            1D array containing thresholded value (e.g. proportion of positive neighbors in kNN)\n",
-    "        value_list: ndarray\n",
-    "            1D array containing different threshold values\n",
-    "    Returns:\n",
-    "        sens: ndarray\n",
-    "            1D array containing sensitivities\n",
-    "        spec_: ndarray\n",
-    "            1D array containing 1-specifities\n",
-    "    '''\n",
-    "    \n",
-    "    # use conf_matrix to calculate tp, tn, fp, fn\n",
-    "    # calculate sensitivity, 1-specificity\n",
-    "    # return two arrays\n",
-    "    sens = []\n",
-    "    spec_ = []\n",
-    "    for threshold in value_list:\n",
-    "        pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]\n",
-    "        tp,tn, fp, fn = conf_matrix(true_labels, pred_labels)        \n",
-    "        se = tp/(tp+fn)\n",
-    "        sens.append(se)\n",
-    "        spec = tn/(tn+fp)\n",
-    "        spec_.append(1 - spec)\n",
-    "        \n",
-    "    return np.array(sens), np.array(spec_)"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
-%% Cell type:markdown id: tags:
-
-# JUPYTER NOTEBOOK TIPS
-
-Each rectangular box is called a cell.
-* Ctrl+ENTER evaluates the current cell; if it contains Python code, it runs the code, if it contains Markdown, it returns rendered text.
-* Alt+ENTER evaluates the current cell and adds a new cell below it.
-* If you click to the left of a cell, you'll notice the frame changes color to blue. You can erase a cell by hitting 'dd' (that's two "d"s in a row) when the frame is blue.
-
-%% Cell type:markdown id: tags:
-
-# Supervised Learning Model Skeleton
-
-We'll use this skeleton for implementing different supervised learning algorithms.
-
-%% Cell type:code id: tags:
-
-``` python
-class Model:
-
-    def fit(self):
-
-        raise NotImplementedError
-
-    def predict(self, test_points):
-        raise NotImplementedError
-```
-
-%% Cell type:code id: tags:
-
-``` python
-def preprocess(feature_file, label_file):
-    '''
-    Args:
-        feature_file: str
-            file containing features
-        label_file: str
-            file containing labels
-    Returns:
-        features: ndarray
-            nxd features
-        labels: ndarray
-            nx1 labels
-    '''
-
-    # read in features and labels
-    features = np.genfromtxt(feature_file)
-    labels = np.genfromtxt(label_file)
-
-    return features, labels
-```
-
-%% Cell type:code id: tags:
-
-``` python
-def partition(size, t, v = 0):
-    '''
-    Args:
-        size: int
-            number of examples in the whole dataset
-        t: float
-            proportion kept for test
-        v: float
-            proportion kept for validation
-    Returns:
-        test_indices: ndarray
-            1D array containing test set indices
-        val_indices: ndarray
-            1D array containing validation set indices
-    '''
-
-    # number of test and validation examples
-    t_size = np.int(np.ceil(size*t))
-    v_size = np.int(np.ceil(size*v))
-
-    # shuffle the indices
-    permuted = np.random.permutation(size)
-
-    # spare the first t_size for test
-    test_indices =  permuted[:t_size]
-    # and the next v_size for validation
-    val_indices = permuted[t_size+1:t_size+v_size+1]
-    train_indices = np.delete(np.arange(size), np.append(test_indices, val_indices), 0)
-
-    return test_indices, val_indices, train_indices
-```
-
-%% Cell type:markdown id: tags:
-
-## TASK 1: Implement `distance` function
-
-%% Cell type:markdown id: tags:
-
-"distance" function will be used in calculating cost of *k*-NN. It should take two data points and the name of the metric and return a scalar value.
-
-%% Cell type:code id: tags:
-
-``` python
-#TODO: Programming Assignment 1
-def distance(x, y, metric):
-    '''
-    Args:
-        x: ndarray
-            1D array containing coordinates for a point
-        y: ndarray
-            1D array containing coordinates for a point
-        metric: str
-            Euclidean, Manhattan
-    Returns:
-        dist: float
-    '''
-    if metric == 'Euclidean':
-        dist = np.sqrt(np.sum(np.square((x-y))))
-    elif metric == 'Manhattan':
-        dist = np.sum(abs(x-y))
-    else:
-        raise ValueError('{} is not a valid metric.'.format(metric))
-    return dist # scalar distance btw x and y
-```
-
-%% Cell type:markdown id: tags:
-
-## General supervised learning performance related functions
-
-%% Cell type:markdown id: tags:
-
-Implement the "conf_matrix" function that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray.
-
-%% Cell type:code id: tags:
-
-``` python
-# TODO: Programming Assignment 1
-
-def conf_matrix(true, pred):
-    '''
-    Args:
-        true:  ndarray
-            nx1 array of true labels for test set
-        pred: ndarray
-            nx1 array of predicted labels for test set
-    Returns:
-        ndarray
-    '''
-
-    tp = tn = fp = fn = 0
-    # calculate true positives (tp), true negatives(tn)
-    # false positives (fp) and false negatives (fn)
-
-    size = len(true)
-    for i in range(size):
-        if true[i]==1:
-            if pred[i] > 0:
-
-                tp += 1
-            else:
-
-                fn += 1
-        else:
-            if pred[i] == 0:
-
-                tn += 1
-            else:
-
-                fp += 1
-
-    # returns the confusion matrix as numpy.ndarray
-    return np.array([tp,tn, fp, fn])
-```
-
-%% Cell type:markdown id: tags:
-
-ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. "ROC" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
-
-%% Cell type:code id: tags:
-
-``` python
-# TODO: Programming Assignment 1
-
-def ROC(true_labels, preds, value_list):
-    '''
-    Args:
-        true_labels: ndarray
-            1D array containing true labels
-        preds: ndarray
-            1D array containing thresholded value (e.g. proportion of positive neighbors in kNN)
-        value_list: ndarray
-            1D array containing different threshold values
-    Returns:
-        sens: ndarray
-            1D array containing sensitivities
-        spec_: ndarray
-            1D array containing 1-specifities
-    '''
-
-    # use conf_matrix to calculate tp, tn, fp, fn
-    # calculate sensitivity, 1-specificity
-    # return two arrays
-    sens = []
-    spec_ = []
-    for threshold in value_list:
-        pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]
-        tp,tn, fp, fn = conf_matrix(true_labels, pred_labels)
-        se = tp/(tp+fn)
-        sens.append(se)
-        spec = tn/(tn+fp)
-        spec_.append(1 - spec)
-
-    return np.array(sens), np.array(spec_)
-```