minor

763d501b · Zeynep Hakguder · 3bb81126 · 763d501b
Commit 763d501b authored 7 years ago by Zeynep Hakguder
--- a/ProgrammingAssignment1.ipynb
+++ b/ProgrammingAssignment1.ipynb
@@ -250,7 +250,7 @@
    "\n",
    "We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.\n",
    "\n",
-    "Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section."
+    "Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section."
   ]
  },
  {

 %% Cell type:markdown id: tags:
 # $k$-Nearest Neighbor
 We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
 * dependent variable should be binary (suited for binary classification)
 * number of features (attributes) should be at least 50
 * number of examples (instances) should be at least 1,000
 A skeleton of a general supervised learning model is provided in "model.ipynb". Please look through it and complete the "preprocess" and "partition" methods.
 ### Assignment Goals:
 In this assignment, we will:
 * learn to split a dataset into training/validation/test partitions
 * use the validation dataset to find a good value for $k$
 * Having found the "best" $k$, we'll obtain final performance measures:
    * accuracy, generalization error and ROC curve
 %% Cell type:markdown id: tags:
 You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.
 %% Cell type:code id: tags:
 ``` python
 import numpy as np
 import matplotlib.pyplot as plt
 ```
 %% Cell type:markdown id: tags:
 Following code makes the Model class and relevant functions available from model.ipynb.
 %% Cell type:code id: tags:
 ``` python
 %run 'model.ipynb'
 ```
 %% Cell type:markdown id: tags:
 Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.
 %% Cell type:code id: tags:
 ``` python
 def distance(x, y, metric):
    '''
    x: a 1xd array
    y: a 1xd array
    metric: Euclidean, Hamming, etc.
    '''
    raise NotImplementedError
    return dist # scalar distance btw x and y
 ```
 %% Cell type:markdown id: tags:
 ### $k$-NN Class Methods
 %% Cell type:markdown id: tags:
 We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input an $mxd$ array containing $d$-dimensional $m$ feature vectors for examples and outputs the predicted class and the ratio of positive examples in $k$ nearest neighbors.
 %% Cell type:code id: tags:
 ``` python
 class kNN(Model):
    '''
    Inherits Model class. Implements the k-NN algorithm for classification.
    '''
    def fit(self, k, distance_f, **kwargs):
        '''
        Fit the model. This is pretty straightforward for k-NN.
        '''
        # set self.k, self.distance_f, self.distance_metric
        raise NotImplementedError
        return
    def predict(self, test_indices):
        raise NotImplementedError
        pred = []
        # for each point in test points
        # use your implementation of distance function
        #  distance_f(..., distance_metric)
        # to find the labels of k-nearest neighbors.
        # Find the ratio of the positive labels
        # and append to pred with pred.append(ratio).
        return np.array(pred)
 ```
 %% Cell type:markdown id: tags:
 ### Build and Evaluate the Model (Accuracy, Confidence Interval, Confusion Matrix)
 %% Cell type:markdown id: tags:
 It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.
 %% Cell type:code id: tags:
 ``` python
 # populate the keyword arguments dictionary kwargs
 kwargs = {'p': 0.3, 'v': 0.1, seed: 123, 'file_path': 'madelon_train'}
 # initialize the model
 my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)
 ```
 %% Cell type:markdown id: tags:
 Assign a value to $k$ and fit the kNN model.
 %% Cell type:code id: tags:
 ``` python
 kwargs_f = {'metric': 'Euclidean'}
 my_model.fit(k = 10, distance_f=distance, **kwargs_f)
 ```
 %% Cell type:markdown id: tags:
 Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate.
 %% Cell type:code id: tags:
 ``` python
 final_labels = my_model.predict(my_model.test_indices)
 # For now, We will consider a data point as predicted in the positive class if more than 0.5
 # of its k-neighbors are positive.
 threshold = 0.5
 # Calculate accuracy and generalization error with confidence interval here.
 ```
 %% Cell type:markdown id: tags:
 ### Plotting a learning curve
 A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
 We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.
 %% Cell type:code id: tags:
 ``` python
 # try sizes 0, 100, 200, 300, ..., up to the largest multiple of 100 >= train_size
 training_sizes = np.xrange(0, my_model.train_size + 1, 100)
 # Calculate error for each entry in training_sizes
 # for training and validation sets and populate
 # error_train and error_val arrays. Each entry in these arrays
 # should correspond to each entry in training_sizes.
 plt.plot(training_sizes, error_train, 'r', label = 'training_error')
 plt.plot(training_sizes, error_val, 'g', label = 'validation_error')
 plt.legend()
 plt.show()
 ```
 %% Cell type:markdown id: tags:
 ### Computing the confusion matrix for $k = 10$
 Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.
 %% Cell type:code id: tags:
 ``` python
 conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold = 0.5)
 ```
 %% Cell type:markdown id: tags:
 ### Finding a good value for $k$
 We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.
-Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
+Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
 %% Cell type:code id: tags:
 ``` python
 # Change values of $k.
 # Calculate accuracies for the validation set.
 # Report a good k value that you'll use in the following analyses.
 ```
 %% Cell type:markdown id: tags:
 ### ROC curve and confusion matrix for the final model
 ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "predict" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
 %% Cell type:markdown id: tags:
 We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier. (Use the $k$ value you found above.)
 %% Cell type:code id: tags:
 ``` python
 # confusion matrix
 conf_matrix(true_classes, predicted_classes)
 ```
 %% Cell type:code id: tags:
 ``` python
 # ROC curve
 roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))
 plt.plot(roc_sens, roc_spec_)
 plt.show()
 ```