We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
* dependent variable should be binary (suited for binary classification)
* number of features (attributes) should be at least 50
* number of examples (instances) should be between 1,000 - 5,000
A skeleton of a general supervised learning model is provided in "model.ipynb". The functions that will be implemented there will be indicated in this notebook.
### Assignment Goals:
In this assignment, we will:
* we'll implement 'Euclidean' and 'Manhattan' distance metrics
* use the validation dataset to find a good value for *k*
* evaluate our model with respect to performance measures:
* accuracy, generalization error and ROC curve
* try to assess if *k*-NN is suitable for the dataset you used
%% Cell type:markdown id: tags:
You will be graded on parts that are marked with **\#TODO** comments. Read the comments in the code to make sure you don't miss any.
### Mandatory for 478 & 878:
| | Tasks | 478 | 878 |
| 1 | Implement `distance` | 10 | 10 |
| 2 | Implement `k-NN` methods | 25 | 20 |
| 3 | Model evaluation | 25 | 20 |
| 4 | Learning curve | 20 | 20 |
| 6 | ROC curve analysis | 20 | 20 |
### Mandatory for 878, bonus for 478
| | Tasks | 478 | 878 |
| 5 | Optimizing *k* | 10 | 10 |
### Bonus for 478/878
| | Tasks | 478 | 878 |
| 7 | Assess suitability of *k*-NN | 10 | 10 |
Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 20 bonus points for 478 students and 10 bonus points for 878 students.
%% Cell type:markdown id: tags:
You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
Following code makes the Model class and relevant functions available from model.ipynb.
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
## TASK 1: Implement `distance` function
%% Cell type:markdown id: tags:
Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the "distance" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value.
We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. Use the "distance" function you defined above. "fit" method takes *k* as an argument. "predict" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors.
%% Cell type:markdown id: tags:
### Rubric:
* correct implementation of fit method +5, +5
* correct implementation of predict method +20, +15
%% Cell type:code id: tags:
``` python
Inherits Model class. Implements the k-NN algorithm for classification.
Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.
# For now, we will consider a data point as predicted in the positive class if more than 0.5
# of its k-neighbors are positive.
# convert predicted ratios to predicted labels
# obtain true positive, true negative,
#false positive and false negative counts using conf_matrix
%% Cell type:markdown id: tags:
Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the 95% confidence interval on the generalization **error** estimate.
%% Cell type:code id: tags:
``` python
# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell.
A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.
%% Cell type:markdown id: tags:
### Rubric:
* Correct training error calculation for different training set sizes +8, +8
* Correct validation error calculation for different training set sizes +8, +8
* Reasonable learning curve +4, +4
%% Cell type:code id: tags:
``` python
# train using %10, %20, %30, ..., 100% of training data
# For each size in training_sizes
# fit the model using "size" data porint
# Calculate error for training and validation sets
# populate error_train and error_val arrays.
# Each entry in these arrays
# should correspond to each entry in training_sizes.
We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.
Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Report confusion matrix for the new value of *k*.
%% Cell type:code id: tags:
``` python
# Change values of k.
# Calculate accuracies for the validation set.
# Report a good k value.
# Calculate the confusion matrix for new k.
%% Cell type:markdown id: tags:
## TASK 6: ROC curve analysis
* Correct implementation +20, +20
%% Cell type:markdown id: tags:
### ROC curve and confusion matrix for the final model
ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different *threshold* values in "predict" and plot the ROC curve. "ROC" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
%% Cell type:markdown id: tags:
We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0.
## TASK 7: Assess suitability of *k*-NN to your dataset
%% Cell type:markdown id: tags:
Use this cell to write about your understanding of why *k*-NN performed well if it did or why not if it didn't. What properties of the dataset affect the performance of the algorithm?
We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
* dependent variable should be binary (suited for binary classification)
* number of features (attributes) should be at least 50
* number of examples (instances) should be between 1,000 - 5,000
A skeleton of a general supervised learning model is provided in "model.ipynb". The functions that will be implemented there will be indicated in this notebook.
### Assignment Goals:
In this assignment, we will:
* we'll implement 'Euclidean' and 'Manhattan' distance metrics
* use the validation dataset to find a good value for *k*
* evaluate our model with respect to performance measures:
* accuracy, generalization error and ROC curve
* try to assess if *k*-NN is suitable for the dataset you used
%% Cell type:markdown id: tags:
You will be graded on parts that are marked with **\#TODO** comments. Read the comments in the code to make sure you don't miss any.
### Mandatory for 478 & 878:
| | Tasks | 478 | 878 |
| 1 | Implement `distance` | 10 | 10 |
| 2 | Implement `k-NN` methods | 25 | 20 |
| 3 | Model evaluation | 25 | 20 |
| 4 | Learning curve | 20 | 20 |
| 6 | ROC curve analysis | 20 | 20 |
### Mandatory for 878, bonus for 478
| | Tasks | 478 | 878 |
| 5 | Optimizing *k* | 10 | 10 |
### Bonus for 478/878
| | Tasks | 478 | 878 |
| 7 | Assess suitability of *k*-NN | 10 | 10 |
Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 20 bonus points for 478 students and 10 bonus points for 878 students.
%% Cell type:markdown id: tags:
You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
Following code makes the Model class and relevant functions available from model.ipynb.
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
## TASK 1: Implement `distance` function
%% Cell type:markdown id: tags:
Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the "distance" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value.
We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. Use the "distance" function you defined above. "fit" method takes *k* as an argument. "predict" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors.
%% Cell type:markdown id: tags:
### Rubric:
* correct implementation of fit method +5, +5
* correct implementation of predict method +20, +15
%% Cell type:code id: tags:
``` python
Inherits Model class. Implements the k-NN algorithm for classification.
Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.
Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the 95% confidence interval on the generalization **error** estimate.
%% Cell type:code id: tags:
``` python
# Calculate and report accuracy and generalization error with confidence interval here. Show your work in this cell.
A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.
%% Cell type:markdown id: tags:
### Rubric:
* Correct training error calculation for different training set sizes +8, +8
* Correct validation error calculation for different training set sizes +8, +8
* Reasonable learning curve +4, +4
%% Cell type:code id: tags:
``` python
# try sizes 50, 100, 150, 200, ..., up to the largest multiple of 50 >= train_size
# train using %10, %20, %30, ..., 100% of training data
# Calculate error for each entry in training_sizes
# for training and validation sets and populate
# error_train and error_val arrays. Each entry in these arrays
# should correspond to each entry in training_sizes.
We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.
Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Hint: Try values both smaller and larger than 10.
### ROC curve and confusion matrix for the final model
ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different *threshold* values in "predict" and plot the ROC curve. "ROC" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
%% Cell type:markdown id: tags:
We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0.
## TASK 7: Assess suitability of *k*-NN to your dataset
%% Cell type:markdown id: tags:
Use this cell to write about your understanding of why *k*-NN performed well if it did or why not if it didn't. What properties of the dataset affect the performance of the algorithm?