Skip to content
Snippets Groups Projects
Commit 763d501b authored by Zeynep Hakguder's avatar Zeynep Hakguder
Browse files

minor

parent 3bb81126
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# $k$-Nearest Neighbor
We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:
* dependent variable should be binary (suited for binary classification)
* number of features (attributes) should be at least 50
* number of examples (instances) should be at least 1,000
A skeleton of a general supervised learning model is provided in "model.ipynb". Please look through it and complete the "preprocess" and "partition" methods.
### Assignment Goals:
In this assignment, we will:
* learn to split a dataset into training/validation/test partitions
* use the validation dataset to find a good value for $k$
* Having found the "best" $k$, we'll obtain final performance measures:
* accuracy, generalization error and ROC curve
%% Cell type:markdown id: tags:
You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
Following code makes the Model class and relevant functions available from model.ipynb.
%% Cell type:code id: tags:
``` python
%run 'model.ipynb'
```
%% Cell type:markdown id: tags:
Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.
%% Cell type:code id: tags:
``` python
def distance(x, y, metric):
'''
x: a 1xd array
y: a 1xd array
metric: Euclidean, Hamming, etc.
'''
raise NotImplementedError
return dist # scalar distance btw x and y
```
%% Cell type:markdown id: tags:
### $k$-NN Class Methods
%% Cell type:markdown id: tags:
We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input an $mxd$ array containing $d$-dimensional $m$ feature vectors for examples and outputs the predicted class and the ratio of positive examples in $k$ nearest neighbors.
%% Cell type:code id: tags:
``` python
class kNN(Model):
'''
Inherits Model class. Implements the k-NN algorithm for classification.
'''
def fit(self, k, distance_f, **kwargs):
'''
Fit the model. This is pretty straightforward for k-NN.
'''
# set self.k, self.distance_f, self.distance_metric
raise NotImplementedError
return
def predict(self, test_indices):
raise NotImplementedError
pred = []
# for each point in test points
# use your implementation of distance function
# distance_f(..., distance_metric)
# to find the labels of k-nearest neighbors.
# Find the ratio of the positive labels
# and append to pred with pred.append(ratio).
return np.array(pred)
```
%% Cell type:markdown id: tags:
### Build and Evaluate the Model (Accuracy, Confidence Interval, Confusion Matrix)
%% Cell type:markdown id: tags:
It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.
%% Cell type:code id: tags:
``` python
# populate the keyword arguments dictionary kwargs
kwargs = {'p': 0.3, 'v': 0.1, seed: 123, 'file_path': 'madelon_train'}
# initialize the model
my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)
```
%% Cell type:markdown id: tags:
Assign a value to $k$ and fit the kNN model.
%% Cell type:code id: tags:
``` python
kwargs_f = {'metric': 'Euclidean'}
my_model.fit(k = 10, distance_f=distance, **kwargs_f)
```
%% Cell type:markdown id: tags:
Evaluate your model on the test data and report your **accuracy**. Also, calculate and report the confidence interval on the generalization **error** estimate.
%% Cell type:code id: tags:
``` python
final_labels = my_model.predict(my_model.test_indices)
# For now, We will consider a data point as predicted in the positive class if more than 0.5
# of its k-neighbors are positive.
threshold = 0.5
# Calculate accuracy and generalization error with confidence interval here.
```
%% Cell type:markdown id: tags:
### Plotting a learning curve
A learning curve shows how error changes as the training set size increases. For more information, see [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/).
We'll plot the error values for training and validation data while varying the size of the training set. Report a good size for training set for which there is a good balance between bias and variance.
%% Cell type:code id: tags:
``` python
# try sizes 0, 100, 200, 300, ..., up to the largest multiple of 100 >= train_size
training_sizes = np.xrange(0, my_model.train_size + 1, 100)
# Calculate error for each entry in training_sizes
# for training and validation sets and populate
# error_train and error_val arrays. Each entry in these arrays
# should correspond to each entry in training_sizes.
plt.plot(training_sizes, error_train, 'r', label = 'training_error')
plt.plot(training_sizes, error_val, 'g', label = 'validation_error')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
### Computing the confusion matrix for $k = 10$
Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function (in model.ipynb) that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet.
%% Cell type:code id: tags:
``` python
conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold = 0.5)
```
%% Cell type:markdown id: tags:
### Finding a good value for $k$
We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.
Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
%% Cell type:code id: tags:
``` python
# Change values of $k.
# Calculate accuracies for the validation set.
# Report a good k value that you'll use in the following analyses.
```
%% Cell type:markdown id: tags:
### ROC curve and confusion matrix for the final model
ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in "model.ipynb", a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "predict" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
%% Cell type:markdown id: tags:
We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier. (Use the $k$ value you found above.)
%% Cell type:code id: tags:
``` python
# confusion matrix
conf_matrix(true_classes, predicted_classes)
```
%% Cell type:code id: tags:
``` python
# ROC curve
roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))
plt.plot(roc_sens, roc_spec_)
plt.show()
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment