"We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.\n",
"\n",
"Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section."
"Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section."
]
},
{
...
...
%% Cell type:markdown id: tags:
# $k$-Nearest Neighbor
We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
* dependent variable should be binary (suited for binary classification)
* number of features (attributes) should be at least 50
* number of examples (instances) should be at least 1,000
A skeleton of a general supervised learning model is provided in "model.ipynb". Please look through it and complete the "preprocess" and "partition" methods.
### Assignment Goals:
In this assignment, we will:
* learn to split a dataset into training/validation/test partitions
* use the validation dataset to find a good value for $k$
* Having found the "best" $k$, we'll obtain final performance measures:
* accuracy, generalization error and ROC curve
%% Cell type:markdown id: tags:
You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.
%% Cell type:code id: tags:
``` python
importnumpyasnp
importmatplotlib.pyplotasplt
```
%% Cell type:markdown id: tags:
Following code makes the Model class and relevant functions available from model.ipynb.
%% Cell type:code id: tags:
``` python
%run'model.ipynb'
```
%% Cell type:markdown id: tags:
Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.
%% Cell type:code id: tags:
``` python
defdistance(x,y,metric):
'''
x: a 1xd array
y: a 1xd array
metric: Euclidean, Hamming, etc.
'''
raiseNotImplementedError
returndist# scalar distance btw x and y
```
%% Cell type:markdown id: tags:
### $k$-NN Class Methods
%% Cell type:markdown id: tags:
We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input an $mxd$ array containing $d$-dimensional $m$ feature vectors for examples and outputs the predicted class and the ratio of positive examples in $k$ nearest neighbors.
%% Cell type:code id: tags:
``` python
classkNN(Model):
'''
Inherits Model class. Implements the k-NN algorithm for classification.
'''
deffit(self,k,distance_f,**kwargs):
'''
Fit the model. This is pretty straightforward for k-NN.
'''
# set self.k, self.distance_f, self.distance_metric
raiseNotImplementedError
return
defpredict(self,test_indices):
raiseNotImplementedError
pred=[]
# for each point in test points
# use your implementation of distance function
# distance_f(..., distance_metric)
# to find the labels of k-nearest neighbors.
# Find the ratio of the positive labels
# and append to pred with pred.append(ratio).
returnnp.array(pred)
```
%% Cell type:markdown id: tags:
### Build and Evaluate the Model (Accuracy, Confidence Interval, Confusion Matrix)
%% Cell type:markdown id: tags:
It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.
%% Cell type:code id: tags:
``` python
# populate the keyword arguments dictionary kwargs
Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate.
# For now, We will consider a data point as predicted in the positive class if more than 0.5
# of its k-neighbors are positive.
threshold=0.5
# Calculate accuracy and generalization error with confidence interval here.
```
%% Cell type:markdown id: tags:
### Plotting a learning curve
A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.
%% Cell type:code id: tags:
``` python
# try sizes 0, 100, 200, 300, ..., up to the largest multiple of 100 >= train_size
Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.
We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.
Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
%% Cell type:code id: tags:
``` python
# Change values of $k.
# Calculate accuracies for the validation set.
# Report a good k value that you'll use in the following analyses.
```
%% Cell type:markdown id: tags:
### ROC curve and confusion matrix for the final model
ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "predict" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
%% Cell type:markdown id: tags:
We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier. (Use the $k$ value you found above.)