minor

763d501b · Zeynep Hakguder · 3bb81126 · 763d501b
Commit 763d501b authored 6 years ago by Zeynep Hakguder
--- a/ProgrammingAssignment1.ipynb
+++ b/ProgrammingAssignment1.ipynb
@@ -250,7 +250,7 @@
    "\n",
    "We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.\n",
    "\n",
-    "Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section."
+    "Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section."
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # $k$-Nearest Neighbor

 We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
 * dependent variable should be binary (suited for binary classification)
 * number of features (attributes) should be at least 50
 * number of examples (instances) should be at least 1,000

 A skeleton of a general supervised learning model is provided in "model.ipynb". Please look through it and complete the "preprocess" and "partition" methods.

 ### Assignment Goals:
 In this assignment, we will:
 * learn to split a dataset into training/validation/test partitions
 * use the validation dataset to find a good value for $k$
 * Having found the "best" $k$, we'll obtain final performance measures:
    * accuracy, generalization error and ROC curve

 %% Cell type:markdown id: tags:

 You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 import matplotlib.pyplot as plt
 ```

 %% Cell type:markdown id: tags:

 Following code makes the Model class and relevant functions available from model.ipynb.

 %% Cell type:code id: tags:

 ``` python
 %run 'model.ipynb'
 ```

 %% Cell type:markdown id: tags:

 Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.

 %% Cell type:code id: tags:

 ``` python
 def distance(x, y, metric):
    '''
    x: a 1xd array
    y: a 1xd array
    metric: Euclidean, Hamming, etc.
    '''
    raise NotImplementedError

    return dist # scalar distance btw x and y
 ```

 %% Cell type:markdown id: tags:

 ### $k$-NN Class Methods

 %% Cell type:markdown id: tags:

 We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input an $mxd$ array containing $d$-dimensional $m$ feature vectors for examples and outputs the predicted class and the ratio of positive examples in $k$ nearest neighbors.

 %% Cell type:code id: tags:

 ``` python
 class kNN(Model):
    '''
    Inherits Model class. Implements the k-NN algorithm for classification.
    '''

    def fit(self, k, distance_f, **kwargs):
        '''
        Fit the model. This is pretty straightforward for k-NN.
        '''
        # set self.k, self.distance_f, self.distance_metric
        raise NotImplementedError

        return


    def predict(self, test_indices):

        raise NotImplementedError

        pred = []
        # for each point in test points
        # use your implementation of distance function
        #  distance_f(..., distance_metric)
        # to find the labels of k-nearest neighbors.

        # Find the ratio of the positive labels
        # and append to pred with pred.append(ratio).


        return np.array(pred)

 ```

 %% Cell type:markdown id: tags:

 ### Build and Evaluate the Model (Accuracy, Confidence Interval, Confusion Matrix)

 %% Cell type:markdown id: tags:

 It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.

 %% Cell type:code id: tags:

 ``` python
 # populate the keyword arguments dictionary kwargs
 kwargs = {'p': 0.3, 'v': 0.1, seed: 123, 'file_path': 'madelon_train'}
 # initialize the model
 my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)
 ```

 %% Cell type:markdown id: tags:

 Assign a value to $k$ and fit the kNN model.

 %% Cell type:code id: tags:

 ``` python
 kwargs_f = {'metric': 'Euclidean'}
 my_model.fit(k = 10, distance_f=distance, **kwargs_f)
 ```

 %% Cell type:markdown id: tags:

 Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate.

 %% Cell type:code id: tags:

 ``` python
 final_labels = my_model.predict(my_model.test_indices)

 # For now, We will consider a data point as predicted in the positive class if more than 0.5
 # of its k-neighbors are positive.
 threshold = 0.5
 # Calculate accuracy and generalization error with confidence interval here.
 ```

 %% Cell type:markdown id: tags:

 ### Plotting a learning curve

 A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
 We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.

 %% Cell type:code id: tags:

 ``` python
 # try sizes 0, 100, 200, 300, ..., up to the largest multiple of 100 >= train_size
 training_sizes = np.xrange(0, my_model.train_size + 1, 100)

 # Calculate error for each entry in training_sizes
 # for training and validation sets and populate
 # error_train and error_val arrays. Each entry in these arrays
 # should correspond to each entry in training_sizes.

 plt.plot(training_sizes, error_train, 'r', label = 'training_error')
 plt.plot(training_sizes, error_val, 'g', label = 'validation_error')
 plt.legend()
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 ### Computing the confusion matrix for $k = 10$
 Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.

 %% Cell type:code id: tags:

 ``` python
 conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold = 0.5)
 ```

 %% Cell type:markdown id: tags:

 ### Finding a good value for $k$

 We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.

-Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
+Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.

 %% Cell type:code id: tags:

 ``` python
 # Change values of $k.
 # Calculate accuracies for the validation set.
 # Report a good k value that you'll use in the following analyses.
 ```

 %% Cell type:markdown id: tags:

 ### ROC curve and confusion matrix for the final model
 ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "predict" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.

 %% Cell type:markdown id: tags:

 We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier. (Use the $k$ value you found above.)

 %% Cell type:code id: tags:

 ``` python
 # confusion matrix
 conf_matrix(true_classes, predicted_classes)
 ```

 %% Cell type:code id: tags:

 ``` python
 # ROC curve
 roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))
 plt.plot(roc_sens, roc_spec_)
 plt.show()
 ```