Skip to content
Snippets Groups Projects
Commit 88f10a75 authored by Zeynep Hakguder's avatar Zeynep Hakguder
Browse files

clean

parent 658639d5
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# k-Nearest Neighbor
%% Cell type:markdown id: tags:
You can use numpy for array operations and matplpotlib for plotting for this assignment. Please do not add other libraries.
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
Following code makes the Model class and relevant functions available from model.ipynb.
%% Cell type:code id: tags:
``` python
%run 'model-Solution.ipynb'
```
%% Cell type:markdown id: tags:
Choice of distance metric plays an important role in the performance of kNN. Let's start by implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.
%% Cell type:code id: tags:
``` python
def distance(x, y, metric):
'''
x: a 1xd array
y: a 1xd array
metric: Euclidean, Hamming, etc.
'''
#raise NotImplementedError
if metric == 'Euclidean':
dist = np.sqrt(np.sum(np.square((x-y))))
####################################
return dist # scalar distance btw x and y
```
%% Cell type:markdown id: tags:
We can implement our kNN classifier. kNN class inherits Model class. Implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input the feature vector for a single test point and outputs the predicted class, and the proportion of predicted class labels in $k$ nearest neighbors.
%% Cell type:code id: tags:
``` python
class kNN(Model):
def fit(self, k, distance_f, **kwargs):
#raise NotImplementedError
self.k = k
self.distance_f = distance_f
self.distance_metric = kwargs['metric']
#######################
return
# vary the threshold value for ROC analysis
def predict(self, test_points):
chosen_labels = []
for test_point in self.features[test_indices]:
#raise NotImplementedError
tmp_dist = [np.inf] * self.k
distances = []
labels = []
for index in self.training_indices:
dist = self.distance_f(self.features[index], test_point, self.distance_metric)
distances.append(dist)
labels.append(self.labels[index])
a_order = np.argsort(distances)
tmp_labels = list(np.array(labels)[a_order[::-1]][:self.k])
b = tmp_labels.count(1)
chosen_labels.append(b/self.k)
##########################
# return the predicted class label and the following ratio:
# number of points that have the same label as the test point / k
return np.array(chosen_labels)
```
%% Cell type:markdown id: tags:
It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.
%% Cell type:code id: tags:
``` python
# populate the keyword arguments dictionary kwargs
kwargs = {'p': 0.3, 'v': 0.1, 'seed': 123, 'file_path': 'madelon_train'}
# initialize the model
my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)
```
%% Cell type:markdown id: tags:
Assign a value to $k$ and fit the kNN model. You do not need to change the value of the $threshold$ parameter yet.
%% Cell type:code id: tags:
``` python
kwargs_f = {'metric': 'Euclidean'}
my_model.fit(k = 10, distance_f=distance, **kwargs_f)
```
%% Cell type:markdown id: tags:
Evaluate your model on the test data and report your accuracy. Also, calculate and report the confidence interval on the generalization error estimate.
%% Cell type:code id: tags:
``` python
final_labels = my_model.predict(my_model.test_indices)
```
%% Cell type:markdown id: tags:
Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray.
%% Cell type:code id: tags:
``` python
# You should see array([ 196, 106, 193, 105]) with seed 123
conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold= 0.5)
```
%% Output
array([196, 106, 193, 105])
%% Cell type:markdown id: tags:
ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "fit" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns (sensitivity, 1-specificity) pair for each $parameter$ value.
%% Cell type:code id: tags:
``` python
def ROC(true, pred, value_list):
'''
true: nx1 array of true labels for test set
pred: nx1 array of predicted labels for test set
Calculate sensitivity and 1-specificity for each point in value_list
Return two nX1 arrays: sens (for sensitivities) and spec_ (for 1-specificities)
'''
return sens, spec_
```
%% Cell type:markdown id: tags:
We can finally create the confusion matrix and plot the ROC curve for our kNN classifier.
%% Cell type:code id: tags:
``` python
# confusion matrix
conf_matrix(true_classes, predicted_classes)
```
%% Cell type:code id: tags:
``` python
# ROC curve
roc_sens, roc_spec_ = ROC(true_classes, predicted_classes, np.arange(0.1, 1.0, 0.1))
plt.plot(roc_sens, roc_spec_)
plt.show()
```
This diff is collapsed.
%% Cell type:markdown id: tags:
# JUPYTER NOTEBOOK TIPS
Each rectangular box is called a cell.
* Ctrl+ENTER evaluates the current cell; if it contains Python code, it runs the code, if it contains Markdown, it returns rendered text.
* Alt+ENTER evaluates the current cell and adds a new cell below it.
* If you click to the left of a cell, you'll notice the frame changes color to blue. You can erase a cell by hitting 'dd' (that's two "d"s in a row) when the frame is blue.
%% Cell type:markdown id: tags:
# Supervised Learning Model Skeleton
We'll use this skeleton for implementing different supervised learning algorithms.
%% Cell type:code id: tags:
``` python
class Model:
def fit(self):
raise NotImplementedError
def predict(self, test_points):
raise NotImplementedError
```
%% Cell type:code id: tags:
``` python
def preprocess(feature_file, label_file):
'''
Args:
feature_file: str
file containing features
label_file: str
file containing labels
Returns:
features: ndarray
nxd features
labels: ndarray
nx1 labels
'''
# read in features and labels
features = np.genfromtxt(feature_file)
labels = np.genfromtxt(label_file)
return features, labels
```
%% Cell type:code id: tags:
``` python
def partition(size, t, v = 0):
'''
Args:
size: int
number of examples in the whole dataset
t: float
proportion kept for test
v: float
proportion kept for validation
Returns:
test_indices: ndarray
1D array containing test set indices
val_indices: ndarray
1D array containing validation set indices
'''
# number of test and validation examples
t_size = np.int(np.ceil(size*t))
v_size = np.int(np.ceil(size*v))
# shuffle the indices
permuted = np.random.permutation(size)
# spare the first t_size for test
test_indices = permuted[:t_size]
# and the next v_size for validation
val_indices = permuted[t_size+1:t_size+v_size+1]
train_indices = np.delete(np.arange(size), np.append(test_indices, val_indices), 0)
return test_indices, val_indices, train_indices
```
%% Cell type:markdown id: tags:
## TASK 1: Implement `distance` function
%% Cell type:markdown id: tags:
"distance" function will be used in calculating cost of *k*-NN. It should take two data points and the name of the metric and return a scalar value.
%% Cell type:code id: tags:
``` python
#TODO: Programming Assignment 1
def distance(x, y, metric):
'''
Args:
x: ndarray
1D array containing coordinates for a point
y: ndarray
1D array containing coordinates for a point
metric: str
Euclidean, Manhattan
Returns:
dist: float
'''
if metric == 'Euclidean':
dist = np.sqrt(np.sum(np.square((x-y))))
elif metric == 'Manhattan':
dist = np.sum(abs(x-y))
else:
raise ValueError('{} is not a valid metric.'.format(metric))
return dist # scalar distance btw x and y
```
%% Cell type:markdown id: tags:
## General supervised learning performance related functions
%% Cell type:markdown id: tags:
Implement the "conf_matrix" function that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray.
%% Cell type:code id: tags:
``` python
# TODO: Programming Assignment 1
def conf_matrix(true, pred):
'''
Args:
true: ndarray
nx1 array of true labels for test set
pred: ndarray
nx1 array of predicted labels for test set
Returns:
ndarray
'''
tp = tn = fp = fn = 0
# calculate true positives (tp), true negatives(tn)
# false positives (fp) and false negatives (fn)
size = len(true)
for i in range(size):
if true[i]==1:
if pred[i] > 0:
tp += 1
else:
fn += 1
else:
if pred[i] == 0:
tn += 1
else:
fp += 1
# returns the confusion matrix as numpy.ndarray
return np.array([tp,tn, fp, fn])
```
%% Cell type:markdown id: tags:
ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. "ROC" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
%% Cell type:code id: tags:
``` python
# TODO: Programming Assignment 1
def ROC(true_labels, preds, value_list):
'''
Args:
true_labels: ndarray
1D array containing true labels
preds: ndarray
1D array containing thresholded value (e.g. proportion of positive neighbors in kNN)
value_list: ndarray
1D array containing different threshold values
Returns:
sens: ndarray
1D array containing sensitivities
spec_: ndarray
1D array containing 1-specifities
'''
# use conf_matrix to calculate tp, tn, fp, fn
# calculate sensitivity, 1-specificity
# return two arrays
sens = []
spec_ = []
for threshold in value_list:
pred_labels = [1 if x >= threshold else 0 for x in pred_ratios]
tp,tn, fp, fn = conf_matrix(true_labels, pred_labels)
se = tp/(tp+fn)
sens.append(se)
spec = tn/(tn+fp)
spec_.append(1 - spec)
return np.array(sens), np.array(spec_)
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment