Skip to content
Snippets Groups Projects
Commit b9420489 authored by Zeynep Hakguder's avatar Zeynep Hakguder
Browse files

PA1 missing learning curve, PA2 missing naive Bayes

parent 76b172e1
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# $k$-Nearest Neighbor
We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. A skeleton of a general supervised learning model is provided in "model.ipynb". Please look through it and complete the "preprocess" and "partition" methods.
### Assignment Goals:
In this assignment, we will:
* learn to split a dataset into training/validation/test partitions
* use the validation dataset to find a good value for $k$
* Having found the "best" $k$, we'll obtain final performance measures:
* accuracy, generalization error and ROC curve
%% Cell type:markdown id: tags:
You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
Following code makes the Model class and relevant functions available from model.ipynb.
%% Cell type:code id: tags:
``` python
%run 'model.ipynb'
```
%% Cell type:markdown id: tags:
Choice of distance metric plays an important role in the performance of $k$-NN. Let's start by implementing a distance method in the "distance" function below. It should take two data points and the name of the metric and return a scalar value.
%% Cell type:code id: tags:
``` python
def distance(x, y, metric):
'''
x: a 1xd array
y: a 1xd array
metric: Euclidean, Hamming, etc.
'''
raise NotImplementedError
return dist # scalar distance btw x and y
```
%% Cell type:markdown id: tags:
### $k$-NN Class Methods
%% Cell type:markdown id: tags:
We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement "fit" and "predict" methods. Use the "distance" function you defined above. "fit" method takes $k$ as an argument. "predict" takes as input the feature vector for a single test point and outputs the predicted class and the proportion of predicted class labels in $k$ nearest neighbors.
%% Cell type:code id: tags:
``` python
class kNN(Model):
'''
Inherits Model class. Implements the k-NN algorithm for classification.
'''
def __init__(self, preprocessor_f, partition_f, distance_f):
super().__init__(preprocessor_f, partition_f)
# set self.distance_f and self.distance_metric
def fit(self, k):
'''
Fit the model. This is pretty straightforward for k-NN.
'''
raise NotImplementedError
return
def predict(self, test_point):
raise NotImplementedError
# use self.distance_f(...,self.distance_metric)
# return the predicted class label and the following ratio:
# number of points that have the same label as the test point / k
return predicted_label, ratio
```
%% Cell type:markdown id: tags:
### Build and Evaluate the Model (Accuracy, Confidence Interval, Confusion Matrix)
%% Cell type:markdown id: tags:
It's time to build and evaluate our model now. Remember you need to provide values to $p$, $v$ parameters for "partition" function and to $file\_path$ for "preprocess" function.
%% Cell type:code id: tags:
``` python
# populate the keyword arguments dictionary kwargs
kwargs = {'p': 0.3, 'v': 0.1, 'file_path': 'mnist_test.csv', 'metric': 'Euclidean'}
# initialize the model
my_model = kNN(preprocessor_f=preprocess, partition_f=partition, distance=distance, **kwargs)
```
%% Cell type:markdown id: tags:
Assign a value to $k$ and fit the $k$-NN model.
%% Cell type:code id: tags:
``` python
my_model.fit(k=10)
```
%% Cell type:markdown id: tags:
You can use "predict_batch" function below to evaluate your model on the test data. You do not need to change the value of the threshold yet.
%% Cell type:code id: tags:
``` python
def predict_batch(model, indices, threshold=0.5):
'''
model: a fitted k-NN model
indices: for data points to predict
threshold: lower limit on the ratio for a point to be considered positive
'''
predicted_labels = []
true_labels = []
for index in indices:
# vary the threshold value for ROC analysis
predicted_classes.append(model.predict(model.features[index], threshold))
true_classes.append(model.labels[index])
return predicted_labels, true_labels
```
%% Cell type:markdown id: tags:
Use "predict_batch" function above to report your model's accuracy on the test set. Also, calculate and report the confidence interval on the generalization error estimate.
%% Cell type:code id: tags:
``` python
predict_batch(my_model, my_model.test_indices)
# Calculate accuracy and generalization error with confidence interval here.
```
%% Cell type:markdown id: tags:
# TODO: leaa
Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the "conf_matrix" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray.
%% Cell type:code id: tags:
``` python
def conf_matrix(true, pred):
'''
true: nx1 array of true labels for test set
pred: nx1 array of predicted labels for test set
'''
raise NotImplementedError
# returns the confusion matrix as numpy.ndarray
return c_mat
```
%% Cell type:markdown id: tags:
### Finding a good value for $k$
We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy. Additionally, in some cases, predicting examples from a certain class correctly is more critical than other classes. In those cases, we can use the confusion matrix to find a good trade off between correct and wrong predictions and allow more wrong predictions in some classes to predict more examples correctly in a that class.
Below calculate the accuracies and confusion matrices for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section.
%% Cell type:code id: tags:
``` python
# Change values of $k.
# Calculate accuracies and confusion matrices for the validation set.
# Report a good k value that you'll use in the following analyses.
```
%% Cell type:markdown id: tags:
### ROC curve and confusion matrix for the final model
ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a "ROC" function that predicts the labels of the test set examples using different $threshold$ values in "predict" and plot the ROC curve. "ROC" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities.
%% Cell type:code id: tags:
``` python
def ROC(model, indices, value_list):
'''
model: a fitted k-NN model
indices: for data points to predict
value_list: array containing different threshold values
Calculate sensitivity and 1-specificity for each point in value_list
Return two nX1 arrays: sens (for sensitivities) and spec_ (for 1-specificities)
'''
# use predict_batch to obtain predicted labels at different threshold values
raise NotImplementedError
return sens, spec_
```
%% Cell type:markdown id: tags:
We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier.
%% Cell type:code id: tags:
``` python
# confusion matrix
conf_matrix(true_classes, predicted_classes)
```
%% Cell type:code id: tags:
``` python
# ROC curve
roc_sens, roc_spec_ = ROC(my_model, my_model.test_indices, np.arange(0.1, 1.0, 0.1))
plt.plot(roc_sens, roc_spec_)
plt.show()
```
%% Cell type:markdown id: tags:
# Linear Regression & Naive Bayes
We'll implement linear regression & Naive Bayes algorithms for this assignment. Please modify the "preprocess" and "partition" methods in "model.ipynb" to suit your datasets for this assignment. In this assignment, we have a small dataset available to us. We won't have examples to spare for validation set, instead we'll use cross-validation to tune hyperparameters.
### Assignment Goals:
In this assignment, we will:
* implement linear regression
* use gradient descent for optimization
* use residuals to decide if we need a polynomial model
* change our model to quadratic/cubic regression and use cross-validation to find the "best" polynomial degree
* implement regularization techniques
* $l_1$/$l_2$ regularization
* use cross-validation to find a good regularization parameter $\lambda$
* implement Naive Bayes
* address sparse data problem with **pseudocounts** (**$m$-estimate**)
%% Cell type:markdown id: tags:
You can use numpy for array operations and matplotlib for plotting for this assignment. Please do not add other libraries.
%% Cell type:code id: tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
Following code makes the Model class and relevant functions available from "model.ipynb".
%% Cell type:code id: tags:
``` python
%run 'model.ipynb'
```
%% Cell type:code id: tags:
``` python
def mse(y_pred, y_true):
'''
y_hat: values predicted by our method
y_true: true y values
'''
raise NotImplementedError
# returns mean squared error between y_pred and y_true
return cost
```
%% Cell type:markdown id: tags:
We'll start by implementing a partition function for $k$-fold cross-validation. $5$ and $10$ are commonly used values for $k$. You can use either one of them.
%% Cell type:code id: tags:
``` python
def kfold(k):
'''
k: number of desired splits in data.
Assume test set is already separated.
This function chooses 1/k of training indices randomly and separates them as validation partition.
It returns the 1/k selected indices as val_indices and the remaining 1-(1/k) of the indices as train_indices
'''
return train_indices, val_indices
```
%% Cell type:code id: tags:
``` python
class linear_regression(Model):
def __init__(self, preprocessor_f, partition_f, **kwargs):
super().__init__(preprocessor_f, partition_f, **kwargs)
theta =
# You can disregard polynomial_degree and regularizer in your first pass
def fit(self, learning_rate = 0.001, epochs = 1000, regularizer=None, polynomial_degree=1, **kwargs):
# for each epoch
# compute y_hat array which holds model predictions for training examples
y_hat = None
# use mse function to find the cost
cost = None
# calculate gradients wrt theta
grad_theta = None
# update theta
theta_curr = None
raise NotImplementedError
return theta
def predict(self, indices):
raise NotImplementedError
return y_hat
```
%% Cell type:code id: tags:
``` python
# populate the keyword arguments dictionary kwargs
kwargs = {'p': 0.3, 'v': 0.0, 'file_path': 'mnist_test.csv', 'k': 5}
# initialize the model
my_model = linear_regression(preprocessor_f=preprocess, partition_f=partition, k_fold=True, **kwargs)
```
%% Cell type:code id: tags:
``` python
# use fit_kwargs to pass arguments to regularization function
fit_kwargs = {}
my_model.fit(**fit_kwargs)
```
%% Cell type:markdown id: tags:
Residuals are the differences between the predicted value $y_{hat}$ and the true value $y$ for each example. Predict $y_{hat}$ for the validation set. Calculate and plot residuals.
%% Cell type:code id: tags:
``` python
y_hat_val = my_model.predict(my_model.features[my_model.val_indices])
residuals = my_model.labels[my_model.val_indices] - y_hat_val
plt.plot(residuals)
plt.show()
```
%% Cell type:markdown id: tags:
If the data is better suited for quadratic/cubic regression, regions of positive and negative residuals will alternate in the plot. Regardless, modify the fit and predict in the class definition to raise the feature values to $polynomial\_degree$. You can directly make the modification in the above definition, do not repeat. Use the validation set to find among the degree of polynomial that results in lowest "mse".
%% Cell type:code id: tags:
``` python
# calculate mse for linear model
fit_kwargs = {}
my_model.fit(polynomial_degree = 1 ,**fit_kwargs)
pred_3 = my_model.predict(my_model.features[my_model.val_indices])
mse_1 = mse(pred_2, my_model.labels[my_model.val_indices])
# calculate mse for quadratic model
my_model.fit(polynomial_degree = 2 ,**fit_kwargs)
pred_2 = my_model.predict(my_model.features[my_model.val_indices])
mse_2 = mse(pred_2, my_model.labels[my_model.val_indices])
# calculate mse for cubic model
my_model.fit(polynomial_degree = 3 ,**fit_kwargs)
pred_3 = my_model.predict(my_model.features[my_model.val_indices])
mse_3 = mse(pred_2, my_model.labels[my_model.val_indices])
```
%% Cell type:markdown id: tags:
Define "regularization" function which implements $l_1$ and $l_2$ regularization. You'll use this function in "fit" method of "linear_regression" class.
%% Cell type:code id: tags:
``` python
def regularization(method):
if method == "l1":
raise NotImplementedError
elif method == "l2":
raise NotImplementedError
```
%% Cell type:markdown id: tags:
Using the validation set, and the value of $polynomial_{degree}$ you found above, try different values of $\lambda$ to find a a good value that results in low $mse$. You can
mnist_test.csv 0 → 100644
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment