minor

3d3404bb · Zeynep Hakguder · b07ec1f9 · 3d3404bb
Commit 3d3404bb authored 6 years ago by Zeynep Hakguder
--- a/ProgrammingAssignment1.ipynb
+++ b/ProgrammingAssignment1.ipynb
@@ -183,28 +183,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "ename": "NameError",
-     "evalue": "name 'my_model' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "\u001b[0;32m<ipython-input-3-e365162558f6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfinal_labels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmy_model\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmy_model\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtest_indices\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mthreshold\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0.5\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0;31m# Calculate accuracy and generalization error with confidence interval here. For now, We will consider a data point as predicted in the positive class if more than 0.5 of its k-neighbors are positive.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;31mNameError\u001b[0m: name 'my_model' is not defined"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "final_labels = my_model.predict(my_model.test_indices)\n",
    "\n",
-    "# Calculate accuracy and generalization error with confidence interval here. \n",
    "# For now, We will consider a data point as predicted in the positive class if more than 0.5 \n",
    "# of its k-neighbors are positive.\n",
-    "threshold = 0.5"
+    "threshold = 0.5\n",
+    "# Calculate accuracy and generalization error with confidence interval here."
   ]
  },
  {
@@ -214,7 +202,7 @@
    " ### Plotting a learning curve\n",
    " \n",
    "A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).\n",
-    "We'll plot the error values for training and validation data while varying the size of the training set."
+    "We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance."
   ]
  },
  {
@@ -267,7 +255,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [

 %% Cell type:markdown id: tags:

 # $k$-Nearest Neighbor

 We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
 * dependent variable should be binary (suited for binary classification)
 * number of features (attributes) should be at least 50
 * number of examples (instances) should be at least 1,000

 A skeleton of a general supervised learning model is provided in "model.ipynb". Please look through it and complete the "preprocess" and "partition" methods.

 ### Assignment Goals:
 In this assignment, we will:
 * learn to split a dataset into training/validation/test partitions
 * use the validation dataset to find a good value for $k$
 * Having found the "best" $k$, we'll obtain final performance measures:
    * accuracy, generalization error and ROC curve

 %% Cell type:markdown id: tags:

 You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 import matplotlib.pyplot as plt
 ```

 %% Cell type:markdown id: tags:

 Following code makes the Model class and relevant functions available from model.ipynb.

 %% Cell type:code id: tags:

 ``` python
 %run 'model.ipynb'
 ```

 %% Cell type:markdown id: tags:

 Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.

 %% Cell type:code id: tags:

 ``` python
 def distance(x, y, metric):
    '''
    x: a 1xd array
    y: a 1xd array
    metric: Euclidean, Hamming, etc.
    '''
    raise NotImplementedError

    return dist # scalar distance btw x and y
 ```

 %% Cell type:markdown id: tags:

 ### $k$-NN Class Methods

 %% Cell type:markdown id: tags:

 We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input an $mxd$ array containing $d$-dimensional $m$ feature vectors for examples and outputs the predicted class and the proportion of predicted class labels in $k$ nearest neighbors.

 %% Cell type:code id: tags:

 ``` python
 class kNN(Model):
    '''
    Inherits Model class. Implements the k-NN algorithm for classification.
    '''

    def fit(self, k, distance_f, **kwargs):
        '''
        Fit the model. This is pretty straightforward for k-NN.
        '''
        # set self.k, self.distance_f, self.distance_metric
        raise NotImplementedError

        return


    def predict(self, test_indices):

        raise NotImplementedError

        pred = []
        # for each point in test points
        # use your implementation of distance function
        #  distance_f(..., distance_metric)
        # to find the labels of k-nearest neighbors.

        # Find the ratio of the positive labels
        # and append to pred with pred.append(ratio).


        return np.array(pred)

 ```

 %% Cell type:markdown id: tags:

 ### Build and Evaluate the Model (Accuracy, Confidence Interval, Confusion Matrix)

 %% Cell type:markdown id: tags:

 It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.

 %% Cell type:code id: tags:

 ``` python
 # populate the keyword arguments dictionary kwargs
 kwargs = {'p': 0.3, 'v': 0.1, seed: 123, 'file_path': 'madelon_train'}
 # initialize the model
 my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)
 ```

 %% Cell type:markdown id: tags:

 Assign a value to $k$ and fit the kNN model.

 %% Cell type:code id: tags:

 ``` python
 kwargs_f = {'metric': 'Euclidean'}
 my_model.fit(k = 10, distance_f=distance, **kwargs_f)
 ```

 %% Cell type:markdown id: tags:

 Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate.

 %% Cell type:code id: tags:

 ``` python
 final_labels = my_model.predict(my_model.test_indices)

-# Calculate accuracy and generalization error with confidence interval here.
 # For now, We will consider a data point as predicted in the positive class if more than 0.5
 # of its k-neighbors are positive.
 threshold = 0.5
+# Calculate accuracy and generalization error with confidence interval here.
 ```

-%% Output
-
-    ---------------------------------------------------------------------------
-    NameError                                 Traceback (most recent call last)
-    <ipython-input-3-e365162558f6> in <module>()
-    ----> 1 final_labels = my_model.predict(my_model.test_indices)
-          2 threshold = 0.5
-          3 # Calculate accuracy and generalization error with confidence interval here. For now, We will consider a data point as predicted in the positive class if more than 0.5 of its k-neighbors are positive.
-    NameError: name 'my_model' is not defined
-
 %% Cell type:markdown id: tags:

 ### Plotting a learning curve

 A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
-We'll plot the error values for training and validation data while varying the size of the training set.
+We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.

 %% Cell type:code id: tags:

 ``` python
 training_sizes = np.xrange(0, my_model.train_size + 1, 100)

 # Calculate error for each entry in training_sizes
 # for training and validation sets and populate
 # error_train and error_val arrays. Each entry in these arrays
 # should correspond to each entry in training_sizes.

 plt.plot(training_sizes, error_train, 'r', label = 'training_error')
 plt.plot(training_sizes, error_val, 'g', label = 'validation_error')
 plt.legend()
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 ### Computing the confusion matrix for $k = 10$
 Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.

 %% Cell type:code id: tags:

 ``` python
 # You should see array([ 196, 106,  193, 105]) with seed 123
 conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold= 0.5)
 ```

 %% Cell type:markdown id: tags:

 ### Finding a good value for $k$

 We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.

 Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.

 %% Cell type:code id: tags:

 ``` python
 # Change values of $k.
 # Calculate accuracies for the validation set.
 # Report a good k value that you'll use in the following analyses.
 ```

 %% Cell type:markdown id: tags:

 ### ROC curve and confusion matrix for the final model
 ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "predict" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.

 %% Cell type:code id: tags:

 ``` python
 def ROC(model, indices, value_list):
    '''
    model: a fitted k-NN model
    indices: for data points to predict
    value_list: array containing different threshold values
    Calculate sensitivity and 1-specificity for each point in value_list
    Return two nX1 arrays: sens (for sensitivities) and spec_ (for 1-specificities)
    '''

    # use predict_batch to obtain predicted labels at different threshold values
    raise NotImplementedError

    return sens, spec_
 ```

 %% Cell type:markdown id: tags:

 We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier.

 %% Cell type:code id: tags:

 ``` python
 # confusion matrix
 conf_matrix(true_classes, predicted_classes)
 ```

 %% Cell type:code id: tags:

 ``` python
 # ROC curve
 roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))
 plt.plot(roc_sens, roc_spec_)
 plt.show()
 ```