From 423f5c4cec8977a1026bfa01e449d698ff858165 Mon Sep 17 00:00:00 2001
From: Zeynep Hakguder <zhakguder@cse.unl.edu>
Date: Wed, 30 May 2018 19:04:02 -0500
Subject: [PATCH] PA1 finalized

---
 .../ProgrammingAssignment1.ipynb              | 32 +++++++++----------
 model.ipynb                                   | 10 +++---
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb b/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb
index eec35ad..e3a24da 100644
--- a/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb
+++ b/ProgrammingAssignment_1/ProgrammingAssignment1.ipynb
@@ -4,9 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# $k$-Nearest Neighbor\n",
+    "# *k*-Nearest Neighbor\n",
     "\n",
-    "We'll implement $k$-Nearest Neighbor ($k$-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n",
+    "We'll implement *k*-Nearest Neighbor (*k*-NN) algorithm for this assignment. We recommend using [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) dataset, although it is not mandatory. If you choose to use a different dataset, it should meet the following criteria:\n",
     "* dependent variable should be binary (suited for binary classification)\n",
     "* number of features (attributes) should be at least 50\n",
     "* number of examples (instances) should be between 1,000 - 5,000\n",
@@ -16,8 +16,8 @@
     "### Assignment Goals:\n",
     "In this assignment, we will:\n",
     "* learn to split a dataset into training/validation/test partitions \n",
-    "* use the validation dataset to find a good value for $k$\n",
-    "* Having found the \"best\" $k$, we'll obtain final performance measures:\n",
+    "* use the validation dataset to find a good value for *k*\n",
+    "* Having found the \"best\" *k*, we'll obtain final performance measures:\n",
     "    * accuracy, generalization error and ROC curve\n"
    ]
   },
@@ -43,7 +43,7 @@
     "\n",
     "|   | Tasks          | 478 | 878 |\n",
     "|---|----------------|-----|-----|\n",
-    "| 5 | Optimizing $k$ | 10  | 10  |\n",
+    "| 5 | Optimizing *k* | 10  | 10  |\n",
     "\n",
     "\n",
     "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 100 points in this assignment and extra 10 bonus points for 478 students."
@@ -93,7 +93,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value."
+    "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function in **model.ipynb**. It should take two data points and the name of the metric and return a scalar value."
    ]
   },
   {
@@ -135,7 +135,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can start implementing our $k$-NN classifier. $k$-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes $k$ as an argument. \"predict\" takes as input an $mxd$ array containing $d$-dimensional $m$ feature vectors for examples and outputs the predicted class and the ratio of positive examples in $k$ nearest neighbors."
+    "We can start implementing our *k*-NN classifier. *k*-NN class inherits Model class. You'll need to implement \"fit\" and \"predict\" methods. Use the \"distance\" function you defined above. \"fit\" method takes *k* as an argument. \"predict\" takes as input an *mxd* array containing *d*-dimensional *m* feature vectors for examples and outputs the predicted class and the ratio of positive examples in *k* nearest neighbors."
    ]
   },
   {
@@ -229,7 +229,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Assign a value to $k$ and fit the kNN model."
+    "Assign a value to *k* and fit the *k*-NN model."
    ]
   },
   {
@@ -269,8 +269,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Computing the confusion matrix for $k = 10$\n",
-    "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet."
+    "### Computing the confusion matrix for *k* = 10\n",
+    "Now that we have the true labels and the predicted ones from our model, we can build a confusion matrix and see how accurate our model is. Implement the \"conf_matrix\" function (in model.ipynb) that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray. You do not need to change the value of the threshold parameter yet."
    ]
   },
   {
@@ -329,7 +329,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## TASK 5: Determining $k$ "
+    "## TASK 5: Determining *k*"
    ]
   },
   {
@@ -337,7 +337,7 @@
    "metadata": {},
    "source": [
     "### Rubric:\n",
-    "* Increased accuracy with new $k$ +5, +5\n",
+    "* Increased accuracy with new *k* +5, +5\n",
     "* Improved confusion matrix +5, +5"
    ]
   },
@@ -345,9 +345,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can use the validation set to come up with a $k$ value that results in better performance in terms of accuracy.\n",
+    "We can use the validation set to come up with a *k* value that results in better performance in terms of accuracy.\n",
     "\n",
-    "Below calculate the accuracies for different values of $k$ using the validation set. Report a good $k$ value and use it in the analyses that follow this section. Report confusion matrix for the new value of $k$."
+    "Below calculate the accuracies for different values of *k* using the validation set. Report a good *k* value and use it in the analyses that follow this section. Report confusion matrix for the new value of *k*."
    ]
   },
   {
@@ -377,14 +377,14 @@
    "metadata": {},
    "source": [
     "### ROC curve and confusion matrix for the final model\n",
-    "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in \"model.ipynb\", a \"ROC\" function that predicts the labels of the test set examples using different $threshold$ values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities."
+    "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement, in \"model.ipynb\", a \"ROC\" function that predicts the labels of the test set examples using different *threshold* values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can finally create the confusion matrix and plot the ROC curve for our optimal $k$-NN classifier. Use the $k$ value you found above, if you completed TASK 5, else use $k$ = 10. We'll plot the ROC curve for values between 0.1 and 1.0."
+    "We can finally create the confusion matrix and plot the ROC curve for our optimal *k*-NN classifier. Use the *k* value you found above, if you completed TASK 5, else use *k* = 10. We'll plot the ROC curve for values between 0.1 and 1.0."
    ]
   },
   {
diff --git a/model.ipynb b/model.ipynb
index 0d6d853..10fe021 100644
--- a/model.ipynb
+++ b/model.ipynb
@@ -25,7 +25,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This step is for reading the dataset and for extracting features and labels. The \"preprocess\" function should return an $n \\times d$ \"features\" array, and an $n \\times 1$ \"labels\" array, where $n$ is the number of examples and $d$ is the number of features in the dataset. In cases where there is a big difference between the scales of features, we want to normalize the features to have values in the same range [0,1]. Since this is not the case with this dataset, we will not do normalization."
+    "This step is for reading the dataset and for extracting features and labels. The \"preprocess\" function should return an *n x d* \"features\" array, and an *n x 1* \"labels\" array, where *n* is the number of examples and *d* is the number of features in the dataset. In cases where there is a big difference between the scales of features, we want to normalize the features to have values in the same range [0,1]. Since this is not the case with this dataset, we will not do normalization."
    ]
   },
   {
@@ -63,7 +63,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next, you'll need to split your dataset into training, validation and test sets. The \"partition\" function should take as input the size of the whole dataset and randomly sample a proportion $t$ of the dataset as test partition and a proportion of $v$ as validation partition. The remaining will be used as training data. For example, to keep 30% of the examples as test and %10 as validation, set $t=0.3$ and $v=0.1$. You should choose these values according to the size of the data available to you. The \"split\" function should return indices of the training, validation and test sets. These will be used to index into the whole training set."
+    "Next, you'll need to split your dataset into training, validation and test sets. The \"partition\" function should take as input the size of the whole dataset and randomly sample a proportion *t* of the dataset as test partition and a proportion of *v* as validation partition. The remaining will be used as training data. For example, to keep 30% of the examples as test and %10 as validation, set *t* = 0.3 and *v* = 0.1. You should choose these values according to the size of the data available to you. The \"split\" function should return indices of the training, validation and test sets. These will be used to index into the whole training set."
    ]
   },
   {
@@ -150,7 +150,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Choice of distance metric plays an important role in the performance of $k$-NN. Let's start with implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value."
+    "Choice of distance metric plays an important role in the performance of *k*-NN. Let's start with implementing a distance method in the \"distance\" function below. It should take two data points and the name of the metric and return a scalar value."
    ]
   },
   {
@@ -233,7 +233,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Implement the \"conf_matrix\" function that takes as input an array of true labels ($true$) and an array of predicted labels ($pred$). It should output a numpy.ndarray."
+    "Implement the \"conf_matrix\" function that takes as input an array of true labels (*true*) and an array of predicted labels (*pred*). It should output a numpy.ndarray."
    ]
   },
   {
@@ -268,7 +268,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a \"ROC\" function that predicts the labels of the test set examples using different $threshold$ values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different $threshold$ parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities."
+    "ROC curves are a good way to visualize sensitivity vs. 1-specificity for varying cut off points. Now, implement a \"ROC\" function that predicts the labels of the test set examples using different *threshold* values in \"predict\" and plot the ROC curve. \"ROC\" takes a list containing different *threshold* parameter values to try and returns two arrays; one where each entry is the sensitivity at a given threshold and the other where entries are 1-specificities."
    ]
   },
   {
-- 
GitLab