diff --git a/ProgrammingAssignment1.ipynb b/ProgrammingAssignment1.ipynb index 0951185f36643a0d18b3b8e31866d574cd73b20e..3dc81dcb07c601505ac7eb145934dbc6cb254760 100644 --- a/ProgrammingAssignment1.ipynb +++ b/ProgrammingAssignment1.ipynb @@ -6,7 +6,12 @@ "source": [ "# $k$-Nearest Neighbor\n", "\n", - "We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. A skeleton of a general supervised learning model is provided in \"model.ipynb\". Please look through it and complete the \"preprocess\" and \"partition\" methods.\n", + "We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n", + "* dependent variable should be binary (suited for binary classification)\n", + "* number of features (attributes) should be at least 50\n", + "* number of examples (instances) should be at least 1,000\n", + "\n", + "A skeleton of a general supervised learning model is provided in \"model.ipynb\". Please look through it and complete the \"preprocess\" and \"partition\" methods. \n", "\n", "### Assignment Goals:\n", "In this assignment, we will:\n", @@ -53,7 +58,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Choice of distance metric plays an important role in the performance of $k$-NN. Let's start by implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value." + "Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value." ] }, { @@ -84,7 +89,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes $k$ as an argument. \"predict\" takes as input the feature vector for a single test point and outputs the predicted class and the proportion of predicted class labels in $k$ nearest neighbors." + "We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes $k$ as an argument. \"predict\" takes as input an $mxd$ array containing $d$-dimensional $m$ feature vectors for examples and outputs the predicted class and the proportion of predicted class labels in $k$ nearest neighbors." ] }, { @@ -97,32 +102,32 @@ " '''\n", " Inherits Model class. Implements the k-NN algorithm for classification.\n", " '''\n", - " def __init__(self, preprocessor_f, partition_f, distance_f):\n", - " super().__init__(preprocessor_f, partition_f)\n", - " \n", - " # set self.distance_f and self.distance_metric\n", - " \n", - " \n", - " def fit(self, k):\n", + " \n", + " def fit(self, k, distance_f, **kwargs):\n", " '''\n", " Fit the model. This is pretty straightforward for k-NN.\n", " '''\n", - " \n", + " # set self.k, self.distance_f, self.distance_metric\n", " raise NotImplementedError\n", " \n", " return\n", " \n", " \n", - " def predict(self, test_point):\n", + " def predict(self, test_indices):\n", " \n", " raise NotImplementedError\n", " \n", + " pred = []\n", + " # for each point in test points\n", + " # use your implementation of distance function\n", + " # distance_f(..., distance_metric)\n", + " # to find the labels of k-nearest neighbors. \n", " \n", - " # use self.distance_f(...,self.distance_metric)\n", + " # Find the ratio of the positive labels\n", + " # and append to pred with pred.append(ratio).\n", " \n", - " # return the predicted class label and the following ratio: \n", - " # number of points that have the same label as the test point / k\n", - " return predicted_label, ratio\n", + "\n", + " return np.array(pred)\n", " " ] }, @@ -147,32 +152,16 @@ "outputs": [], "source": [ "# populate the keyword arguments dictionary kwargs\n", - "kwargs = {'p': 0.3, 'v': 0.1, 'file_path': 'mnist_test.csv', 'metric': 'Euclidean'}\n", + "kwargs = {'p': 0.3, 'v': 0.1, seed: 123, 'file_path': 'madelon_train'}\n", "# initialize the model\n", - "my_model = kNN(preprocessor_f=preprocess, partition_f=partition, distance=distance, **kwargs)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Assign a value to $k$ and fit the $k$-NN model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "my_model.fit(k=10)" + "my_model = kNN(preprocessor_f=preprocess, partition_f=partition, **kwargs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "You can use \"predict_batch\" function below to evaluate your model on the test data. You do not need to change the value of the threshold yet." + "Assign a value to $k$ and fit the kNN model." ] }, { @@ -181,29 +170,15 @@ "metadata": {}, "outputs": [], "source": [ - "def predict_batch(model, indices, threshold=0.5):\n", - " '''\n", - " model: a fitted k-NN model\n", - " indices: for data points to predict\n", - " threshold: lower limit on the ratio for a point to be considered positive\n", - " '''\n", - " \n", - " predicted_labels = []\n", - " true_labels = []\n", - "\n", - " for index in indices:\n", - " # vary the threshold value for ROC analysis\n", - " predicted_classes.append(model.predict(model.features[index], threshold))\n", - " true_classes.append(model.labels[index])\n", - "\n", - " return predicted_labels, true_labels" + "kwargs_f = {'metric': 'Euclidean'}\n", + "my_model.fit(k = 10, distance_f=distance, **kwargs_f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Use \"predict_batch\" function above to report your model's accuracy on the test set. Also, calculate and report the confidence interval on the generalization error estimate." + "Evaluate your model on the test data and report your accuracy. Also, calculate and report the confidence interval on the generalization error estimate." ] }, { @@ -212,7 +187,7 @@ "metadata": {}, "outputs": [], "source": [ - "predict_batch(my_model, my_model.test_indices)\n", + "final_labels = my_model.predict(my_model.test_indices)\n", "# Calculate accuracy and generalization error with confidence interval here." ] }, @@ -220,9 +195,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# TODO: leaa \n", + "# TODO: learning curve \n", "\n", - "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. " + "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet." ] }, { @@ -231,14 +206,8 @@ "metadata": {}, "outputs": [], "source": [ - "def conf_matrix(true, pred):\n", - " '''\n", - " true: nx1 array of true labels for test set\n", - " pred: nx1 array of predicted labels for test set\n", - " '''\n", - " raise NotImplementedError\n", - " # returns the confusion matrix as numpy.ndarray\n", - " return c_mat" + "# You should see array([ 196, 106, 193, 105]) with seed 123\n", + "conf_matrix(my_model.labels[my_model.test_indices], final_labels, threshold= 0.5)" ] }, { @@ -259,7 +228,7 @@ "outputs": [], "source": [ "# Change values of $k. \n", - "# Calculate accuracies and confusion matrices for the validation set.\n", + "# Calculate accuracies for the validation set.\n", "# Report a good k value that you'll use in the following analyses." ] }, diff --git a/ProgrammingAssignment2.ipynb b/ProgrammingAssignment2.ipynb index 428855aa73245f97c4298e2c421df8e0737a1f45..41badf7aafb60b063f53955f8e76c2e87c35f603 100644 --- a/ProgrammingAssignment2.ipynb +++ b/ProgrammingAssignment2.ipynb @@ -31,7 +31,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -48,7 +48,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -57,7 +57,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -176,7 +176,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If the data is better suited for quadratic/cubic regression, regions of positive and negative residuals will alternate in the plot. Regardless, modify the fit and predict in the class definition to raise the feature values to $polynomial\\_degree$. You can directly make the modification in the above definition, do not repeat. Use the validation set to find among the degree of polynomial that results in lowest \"mse\"." + "If the data is better suited for quadratic/cubic regression, regions of positive and negative residuals will alternate in the plot. Regardless, modify fit\" and \"predict\" in the class definition to raise the feature values to $polynomial\\_degree$. You can directly make the modification in the above definition, do not repeat. Use the validation set to find the degree of polynomial that results in lowest \"mse\"." ] }, { @@ -211,7 +211,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ diff --git a/model-Solution.ipynb b/model-Solution.ipynb index 943c73d631331e5d1aa8c6584e909a21c74a1149..ccdf6101e2a4027f83ac529c3f66d6e214faed01 100644 --- a/model-Solution.ipynb +++ b/model-Solution.ipynb @@ -74,7 +74,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -97,6 +97,39 @@ " def predict(self):\n", " raise NotImplementedError" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def conf_matrix(true_l, pred, threshold):\n", + " tp = tn = fp = fn = 0\n", + " \n", + " for i in range(len(true_l)):\n", + " tmp = -1\n", + " \n", + " if pred[i] > threshold:\n", + " tmp = 1\n", + " if tmp == true_l[i]:\n", + " \n", + " if true_l[i] == 1:\n", + " tp += 1\n", + " else:\n", + " tn += 1\n", + " else:\n", + " if true_l[i] == 1:\n", + " fn += 1\n", + " else:\n", + " fp += 1\n", + " \n", + " return np.array([tp,tn, fp, fn])\n", + " \n", + " \n", + " # returns the confusion matrix as numpy.ndarray\n", + " #raise NotImplementedError" + ] } ], "metadata": { diff --git a/model.ipynb b/model.ipynb index 7ba1ea9653ea6209dbc9a2f7c20506e67130ab4a..fa80a4feb258f1a42087db329c474f7fb7ed3d73 100644 --- a/model.ipynb +++ b/model.ipynb @@ -71,6 +71,9 @@ " # np.random.choice might come in handy. Do not sample with replacement!\n", " # Be sure to not use the same indices in test and validation sets!\n", " \n", + " # use the first np.ceil(size*p) for test, \n", + " # the following np.ceil(size*v) for validation set.\n", + " \n", " raise NotImplementedError\n", " \n", " # return two 1d arrays: one keeping validation set indices, the other keeping test set indices \n", @@ -79,7 +82,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -103,6 +106,41 @@ " def predict(self, testpoint):\n", " raise NotImplementedError" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## General supervised learning related functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Implement the \"conf_matrix\" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def conf_matrix(true, pred):\n", + " '''\n", + " true: nx1 array of true labels for test set\n", + " pred: nx1 array of predicted labels for test set\n", + " '''\n", + " raise NotImplementedError\n", + " \n", + " tp = tn = fp = fn = 0\n", + " # calculate true positives (tp), true negatives(tn)\n", + " # false positives (fp) and false negatives (fn)\n", + " \n", + " # returns the confusion matrix as numpy.ndarray\n", + " return np.array([tp,tn, fp, fn])" + ] } ], "metadata": {