minor

090ff480 · Zeynep Hakguder · 32d3b81e · 090ff480 · 090ff480
Commit 090ff480 authored 6 years ago by Zeynep Hakguder
--- a/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb
+++ b/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb
@@ -284,7 +284,6 @@
   "source": [
    "# TODO\n",
    "\n",
-    "\n",
    "# get model predictions\n",
    "pred_ratios = my_model.predict(my_model.features[my_model.test_indices])\n",
    "\n",

 %% Cell type:markdown id: tags:

 # *k*-Nearest Neighbor

 We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
 * dependent variable should be binary (suited for binary classification)
 * number of features (attributes) should be at least 50
 * number of examples (instances) should be between 1,000 - 5,000

 A skeleton of a general supervised learning model is provided in "model.ipynb". The functions that will be implemented there will be indicated in this notebook.

 ### Assignment Goals:
 In this assignment, we will:
 * we'll implement 'Euclidean' and 'Manhattan' distance metrics
 * use the validation dataset to find a good value for *k*
 * evaluate our model with respect to performance measures:
    * accuracy, generalization error and ROC curve
 * try to assess if *k*-NN is suitable for the dataset you used

 %% Cell type:markdown id: tags:

 # GRADING

 You will be graded on parts that are marked with **\#TODO** comments. Read the comments in the code to make sure you don't miss any.

 ### Mandatory for 478 & 878:

 |   | Tasks                      | 478 | 878 |
 |---|----------------------------|-----|-----|
 | 1 | Implement `distance`       |  10 |  10 |
 | 2 | Implement `k-NN` methods   |  25 |  20 |
 | 3 | Model evaluation           |  25 |  20 |
 | 4 | Learning curve             |  20 |  20 |
 | 6 | ROC curve analysis         |  20 |  20 |

 ### Mandatory for 878, bonus for 478

 |   | Tasks          | 478 | 878 |
 |---|----------------|-----|-----|
 | 5 | Optimizing *k* | 10  | 10  |

 ### Bonus for 478/878

 |   | Tasks          | 478 | 878 |
 |---|----------------|-----|-----|
 | 7 | Assess suitability of *k*-NN | 10  | 10  |

 Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 20 bonus points for 478 students and 10 bonus points for 878 students.

 %% Cell type:markdown id: tags:

 You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 import matplotlib.pyplot as plt
 ```

 %% Cell type:markdown id: tags:

 Following code makes the Model class and relevant functions available from model.ipynb.

 %% Cell type:code id: tags:

 ``` python
 %run 'model.ipynb'
 ```

 %% Cell type:markdown id: tags:

 ## TASK 1: Implement `distance` function

 %% Cell type:markdown id: tags:

 Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method  in the "distance" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value.

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Euclidean +5, +5
 * Manhattan +5, +5

 %% Cell type:markdown id: tags:

 ### Test `distance`

 %% Cell type:code id: tags:

 ``` python
 x = np.array(range(100))
 y = np.array(range(100, 200))
 dist_euclidean = distance(x, y, 'Euclidean')
 dist_manhattan = distance(x, y, 'Manhattan')
 print('Euclidean distance: {}, Manhattan distance: {}'.format(dist_euclidean, dist_manhattan))
 ```

 %% Cell type:markdown id: tags:

 ## TASK 2: Implement $k$-NN Class Methods

 %% Cell type:markdown id: tags:

 We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. Use the "distance" function you defined above. "fit" method takes *k* as an argument. "predict" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors.

 %% Cell type:markdown id: tags:

 ### Rubric:
 * correct implementation of fit method +5, +5
 * correct implementation of predict method +20, +15

 %% Cell type:code id: tags:

 ``` python
 class kNN(Model):
    '''
    Inherits Model class. Implements the k-NN algorithm for classification.
    '''

    def fit(self, training_features, training_labels, k, distance_f,**kwargs):
        '''
        Fit the model. This is pretty straightforward for k-NN.
        Args:
            training_features: ndarray
            training_labels: ndarray
            k: int
            distance_f: function
            kwargs: dict
                Contains keyword arguments that will be passed to distance_f
        '''
        # TODO
        # set self.train_features, self.train_labels, self.k, self.distance_f, self.distance_metric

        raise NotImplementedError

        return


    def predict(self, test_features):
        '''
        Args:
            test_features: ndarray
                mxd array containing features for the points to be predicted
        Returns:
            ndarray
        '''
        raise NotImplementedError

        pred = []
        # TODO

        # for each point in test_features
        # use your implementation of distance function
        #  distance_f(..., distance_metric)
        # to find the labels of k-nearest neighbors.

        # Find the ratio of the positive labels
        # and append to pred with pred.append(ratio).

        # when calculating learning curve you can make use of
        # self.learning_curve and self.training_proportion

        return np.array(pred)

 ```

 %% Cell type:markdown id: tags:

 ## TASK 3: Build and Evaluate the Model

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Reasonable accuracy values +10, +5
 * Reasonable confidence intervals on the error estimate +10, +10
 * Reasonable confusion matrix +5, +5

 %% Cell type:markdown id: tags:

 Preprocess the data files and partition the data.

 %% Cell type:code id: tags:

 ``` python
 # initialize the model
 my_model = kNN()
 # obtain features and labels from files
 features, labels = preprocess(feature_file=..., label_file=...)
 # partition the data set
 val_indices, test_indices, train_indices = partition(size=..., t = 0.3, v = 0.1)
 ```

 %% Cell type:markdown id: tags:

 Assign a value to *k* and fit the *k*-NN model.

 %% Cell type:code id: tags:

 ``` python
 # pass the training features and labels to the fit method
 kwargs_f = {'metric': 'Euclidean'}
 my_model.fit(training_features=..., training_labels-..., k=10, distance_f=..., **kwargs_f)
 ```

 %% Cell type:markdown id: tags:

 ### Computing the confusion matrix for *k* = 10
 Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.

 %% Cell type:code id: tags:

 ``` python
 # TODO

-
 # get model predictions
 pred_ratios = my_model.predict(my_model.features[my_model.test_indices])

 # For now, we will consider a data point as predicted in the positive class if more than 0.5
 # of its k-neighbors are positive.
 threshold = 0.5
 # convert predicted ratios to predicted labels
 pred_labels = None

 # obtain true positive, true negative,
 #false positive and false negative counts using conf_matrix
 tp,tn, fp, fn = conf_matrix(...)
 ```

 %% Cell type:markdown id: tags:

 Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the 95% confidence interval on the generalization **error** estimate.

 %% Cell type:code id: tags:

 ``` python
 # TODO
 # Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell.

 print('Accuracy: {}'.format(accuracy))
 print('Confidence interval: {}-{}'.format(lower_bound, upper_bound))
 ```

 %% Cell type:markdown id: tags:

 ## TASK 4: Plotting a learning curve

 A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
 We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Correct training error calculation for different training set sizes +8, +8
 * Correct validation error calculation for different training set sizes +8, +8
 * Reasonable learning curve +4, +4

 %% Cell type:code id: tags:

 ``` python
 # train using %10, %20, %30, ..., 100% of training data
 training_proportions = np.arange(0.10, 1.01, 0.10)
 train_size = len(train_indices)
 training_sizes = np.int(np.ceil(size*proportion))

 # TODO
 error_train = []
 error_val = []

 # For each size in training_sizes
 for size in training_sizes:
    # fit the model using "size" data porint
    # Calculate error for training and validation sets
    # populate error_train and error_val arrays.
    # Each entry in these arrays
    # should correspond to each entry in training_sizes.

 plt.plot(training_sizes, error_train, 'r', label = 'training_error')
 plt.plot(training_sizes, error_val, 'g', label = 'validation_error')
 plt.legend()
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 ## TASK 5: Determining *k*

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Increased accuracy with new *k* +5, +5
 * Improved confusion matrix +5, +5

 %% Cell type:markdown id: tags:

 We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.

 Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Report confusion matrix for the new value of *k*.

 %% Cell type:code id: tags:

 ``` python
 # TODO

 # Change values of k.
 # Calculate accuracies for the validation set.
 # Report a good k value.
 # Calculate the confusion matrix for new k.
 ```

 %% Cell type:markdown id: tags:

 ## TASK 6: ROC curve analysis
 * Correct implementation +20, +20

 %% Cell type:markdown id: tags:

 ### ROC curve and confusion matrix for the final model
 ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different *threshold* values in "predict" and plot the ROC curve. "ROC" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.

 %% Cell type:markdown id: tags:

 We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0.

 %% Cell type:code id: tags:

 ``` python
 # TODO
 # ROC curve
 roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))
 plt.plot(roc_sens, roc_spec_)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 ## TASK 7: Assess suitability of *k*-NN to your dataset

 %% Cell type:markdown id: tags:

 Use this cell to write about your understanding of why *k*-NN performed well if it did or why not if it didn't. What properties of the dataset affect the performance of the algorithm?

--- a/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb
+++ b/ProgrammingAssignment_1/ProgrammingAssignment1_solution.ipynb
@@ -389,7 +389,7 @@
    }
   ],
   "source": [
-    "# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size\n",
+    "# train using %10, %20, %30, ..., 100% of training data\n",
    "training_proportions = np.arange(0.10, 1.01, 0.10)\n",
    "\n",
    "# TODO\n",

 %% Cell type:markdown id: tags:

 # *k*-Nearest Neighbor

 We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
 * dependent variable should be binary (suited for binary classification)
 * number of features (attributes) should be at least 50
 * number of examples (instances) should be between 1,000 - 5,000

 A skeleton of a general supervised learning model is provided in "model.ipynb". The functions that will be implemented there will be indicated in this notebook.

 ### Assignment Goals:
 In this assignment, we will:
 * we'll implement 'Euclidean' and 'Manhattan' distance metrics
 * use the validation dataset to find a good value for *k*
 * evaluate our model with respect to performance measures:
    * accuracy, generalization error and ROC curve
 * try to assess if *k*-NN is suitable for the dataset you used

 %% Cell type:markdown id: tags:

 # GRADING

 You will be graded on parts that are marked with **\#TODO** comments. Read the comments in the code to make sure you don't miss any.

 ### Mandatory for 478 & 878:

 |   | Tasks                      | 478 | 878 |
 |---|----------------------------|-----|-----|
 | 1 | Implement `distance`       |  10 |  10 |
 | 2 | Implement `k-NN` methods   |  25 |  20 |
 | 3 | Model evaluation           |  25 |  20 |
 | 4 | Learning curve             |  20 |  20 |
 | 6 | ROC curve analysis         |  20 |  20 |

 ### Mandatory for 878, bonus for 478

 |   | Tasks          | 478 | 878 |
 |---|----------------|-----|-----|
 | 5 | Optimizing *k* | 10  | 10  |

 ### Bonus for 478/878

 |   | Tasks          | 478 | 878 |
 |---|----------------|-----|-----|
 | 7 | Assess suitability of *k*-NN | 10  | 10  |

 Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 20 bonus points for 478 students and 10 bonus points for 878 students.

 %% Cell type:markdown id: tags:

 You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 import matplotlib.pyplot as plt
 ```

 %% Cell type:markdown id: tags:

 Following code makes the Model class and relevant functions available from model.ipynb.

 %% Cell type:code id: tags:

 ``` python
 %run 'model_solution.ipynb'
 ```

 %% Cell type:markdown id: tags:

 ## TASK 1: Implement `distance` function

 %% Cell type:markdown id: tags:

 Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method  in the "distance" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value.

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Euclidean +5, +5
 * Manhattan +5, +5

 %% Cell type:markdown id: tags:

 ### Test `distance`

 %% Cell type:code id: tags:

 ``` python
 x = np.array(range(100))
 y = np.array(range(100, 200))
 dist_euclidean = distance(x, y, 'Euclidean')
 dist_manhattan = distance(x, y, 'Manhattan')
 print('Euclidean distance: {}, Manhattan distance: {}'.format(dist_euclidean, dist_manhattan))
 ```

 %% Output

    Euclidean distance: 1000.0, Manhattan distance: 10000

 %% Cell type:markdown id: tags:

 ## TASK 2: Implement $k$-NN Class Methods

 %% Cell type:markdown id: tags:

 We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. Use the "distance" function you defined above. "fit" method takes *k* as an argument. "predict" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors.

 %% Cell type:markdown id: tags:

 ### Rubric:
 * correct implementation of fit method +5, +5
 * correct implementation of predict method +20, +15

 %% Cell type:code id: tags:

 ``` python
 class kNN(Model):
    '''
    Inherits Model class. Implements the k-NN algorithm for classification.
    '''

    def fit(self, training_features, training_labels, k, distance_f, **kwargs):
        '''
        Fit the model. This is pretty straightforward for k-NN.
        Args:
            training_features: ndarray
            training_labels: ndarray
            k: int
            distance_f: function
            kwargs: dict
                Contains keyword arguments that will be passed to distance_f
        '''
        # TODO
        # set self.train_features, self.train_labels,self.k, self.distance_f, self.distance_metric
        self.train_features = training_features
        self.train_labels = training_labels
        self.k = k
        self.distance_f = distance_f
        self.distance_metric = kwargs['metric']

        return


    def predict(self, test_features):

        test_size = len(test_features)
        train_size = len(self.train_labels)

        pred = np.empty(len(test_features))
        # TODO
        # for each point in test points
        for idx in range(test_size):
            point = test_features[idx]
            distances = []
            labels = []


            for tr_idx in range(train_size):
                train_example = self.train_features[tr_idx]
                train_label = self.train_labels[tr_idx]
                dist = self.distance_f(point, train_example, metric = self.distance_metric)
                distances.append(dist)
                labels.append(train_label)

            # get the order of distances
            dist_order = np.argsort(distances)
            # get the labels of k points that are closest to test point
            k_labels = list(np.array(labels)[dist_order[::-1]][:self.k])

            # get number of positive labels in k neighbours
            b = k_labels.count(1)

            pred[idx] = b/self.k

        return pred

 ```

 %% Cell type:markdown id: tags:

 ## TASK 3: Build and Evaluate the Model

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Reasonable accuracy values +10, +5
 * Reasonable confidence intervals on the error estimate +10, +10
 * Reasonable confusion matrix +5, +5

 %% Cell type:markdown id: tags:

 Preprocess the data files and partition the data.

 %% Cell type:code id: tags:

 ``` python
 # initialize the model
 my_model = kNN()
 # obtain features and labels from files
 features, labels = preprocess('../data/madelon.data', '../data/madelon.labels')
 # partition the data set
 val_indices, test_indices, train_indices = partition(features.shape[0], t = 0.3, v = 0.1)
 ```

 %% Cell type:markdown id: tags:

 Assign a value to *k* and fit the *k*-NN model.

 %% Cell type:code id: tags:

 ``` python
 # pass the training features and labels to the fit method
 kwargs_f = {'metric': 'Euclidean'}
 my_model.fit(features[train_indices], labels[train_indices], k=10, distance_f=distance, **kwargs_f)
 ```

 %% Cell type:markdown id: tags:

 ### Computing the confusion matrix for *k* = 10
 Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.

 %% Cell type:code id: tags:

 ``` python
 # TODO

 # get model predictions
 pred_ratios = my_model.predict(features[test_indices])
 # For now, we will consider a data point as predicted in the positive class if more than 0.5
 # of its k-neighbors are positive.
 threshold = 0.5
 # convert predicted ratios to predicted labels
 pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]
 tp,tn, fp, fn = conf_matrix(labels[test_indices], pred_labels)
 ```

 %% Cell type:markdown id: tags:

 Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the 95% confidence interval on the generalization **error** estimate.

 %% Cell type:code id: tags:

 ``` python
 # TODO
 # Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell.
 accuracy = (tp+tn)/len(test_indices)
 error = 1 - accuracy
 diff = 0.96 * np.sqrt((error * (1 - error)) / len(test_indices))
 lower_bound = error - diff
 upper_bound = error + diff
 print('Accuracy: {}'.format(accuracy))
 print('Confidence interval: {}-{}'.format(lower_bound, upper_bound))
 ```

 %% Output

    Accuracy: 0.475
    Confidence interval: 0.49110132745961876-0.5588986725403813

 %% Cell type:markdown id: tags:

 ## TASK 4: Plotting a learning curve

 A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
 We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Correct training error calculation for different training set sizes +8, +8
 * Correct validation error calculation for different training set sizes +8, +8
 * Reasonable learning curve +4, +4

 %% Cell type:code id: tags:

 ``` python
-# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size
+# train using %10, %20, %30, ..., 100% of training data
 training_proportions = np.arange(0.10, 1.01, 0.10)

 # TODO

 # Calculate error for each entry in training_sizes
 # for training and validation sets and populate
 # error_train and error_val arrays. Each entry in these arrays
 # should correspond to each entry in training_sizes.

 error_train = []
 error_val = []
 training_sizes = []
 for proportion in training_proportions:

    size = len(train_indices)
    size_avail = np.int(np.ceil(size*proportion))
    training_sizes.append(size_avail)
    idx_avail = train_indices[:size_avail]

    kwargs_f = {'metric': 'Euclidean'}
    my_model.fit(features[idx_avail], labels[idx_avail], k = 10, distance_f=distance, **kwargs_f)

    val_pred_ratios = my_model.predict(features[val_indices])
    val_pred_labels = [1 if x >= threshold else 0 for x in val_pred_ratios]
    tp,tn, fp, fn = conf_matrix(labels[val_indices], val_pred_labels)
    val_accuracy = (tp+tn)/len(val_indices)
    val_error = 1 - val_accuracy
    error_val.append(val_error)

    train_pred_ratios = my_model.predict(features[train_indices])
    train_pred_labels = [1 if x >= threshold else 0 for x in train_pred_ratios]
    tp,tn, fp, fn = conf_matrix(labels[idx_avail], train_pred_labels)
    train_accuracy = (tp+tn)/size_avail
    train_error = 1 - train_accuracy
    error_train.append(train_error)

 plt.plot(training_sizes, error_train, 'r', label = 'training_error')
 plt.plot(training_sizes, error_val, 'g', label = 'validation_error')
 plt.legend()
 plt.show()
 print('{} is a good training size'.format(600))
 ```

 %% Output



    600 is a good training size

 %% Cell type:markdown id: tags:

 ## TASK 5: Determining *k*

 %% Cell type:markdown id: tags:

 ### Rubric:
 * Increased accuracy with new *k* +5, +5
 * Improved confusion matrix +5, +5

 %% Cell type:markdown id: tags:

 We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.

 Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Hint: Try values both smaller and larger than 10.

 %% Cell type:code id: tags:

 ``` python
 # TODO
 k_accuracies = []
 # Change values of k.
 for k in [1, 5,  10, 50, 100, 150, 200]:
    # Calculate accuracies for the validation set.
    my_model.fit(features[train_indices], labels[train_indices], k=k, distance_f=distance, **kwargs_f)
    pred_ratios = my_model.predict(features[val_indices])
    pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]
    tp,tn, fp, fn = conf_matrix(labels[val_indices], pred_labels)
    accuracy = (tp+tn)/len(val_indices)
    k_accuracies.append(accuracy)
 k_accuracies
 # Report a good k value.
 ```

 %% Output

    [0.49666666666666665, 0.55, 0.545, 0.505, 0.495, 0.465, 0.445]

 %% Cell type:markdown id: tags:

 ## TASK 6: ROC curve analysis
 * ROC curve has correct shape +20, +20

 %% Cell type:markdown id: tags:

 ### ROC curve and confusion matrix for the final model
 ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different *threshold* values in "predict" and plot the ROC curve. "ROC" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.

 %% Cell type:markdown id: tags:

 We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0.

 %% Cell type:code id: tags:

 ``` python
 # TODO
 # ROC curve
 my_model.fit(features[train_indices], labels[train_indices], k=100, distance_f=distance, **kwargs_f)
 pred_ratios = my_model.predict(features[test_indices])

 roc_sens, roc_spec_ = ROC(labels[test_indices], pred_ratios, np.arange(0.1, 1.0, 0.1))
 plt.plot(roc_sens, roc_spec_)
 plt.xlabel('Sensitivity')
 plt.ylabel('Specificity')
 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 ## TASK 7: Assess suitability of *k*-NN to your dataset

 %% Cell type:markdown id: tags:

 Use this cell to write about your understanding of why *k*-NN performed well if it did or why not if it didn't. What properties of the dataset affect the performance of the algorithm?