diff --git a/ProgrammingAssignment_0/GettingFamiliar.ipynb b/ProgrammingAssignment_0/GettingFamiliar.ipynb index 637804ef7d026ad8a552cefc5d4ec346bd9cac9c..999b7b79fa5b7b35e685a628e75a68b83de22b1c 100644 --- a/ProgrammingAssignment_0/GettingFamiliar.ipynb +++ b/ProgrammingAssignment_0/GettingFamiliar.ipynb @@ -32,7 +32,7 @@ "\n", "| | Tasks | 478 | 878 |\n", "|---|---------------------------------------|-----|-----|\n", - "|4 | Modify `preprocess` for normalization | 5 | 10 |\n", + "|4 | Implement `normalization` | 5 | 10 |\n", "\n", "\n", "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 25 points in this assignment and extra 5 bonus points for 478 students." @@ -44,7 +44,7 @@ "source": [ "# Supervised Learning Model Skeleton\n", "\n", - "We'll use this skeleton for implementing different supervised learning algorithms. For this first assignment, we'll read and partition the [\"madelon\" dataset](http://archive.ics.uci.edu/ml/datasets/madelon). Features and labels for the first two examples are listed below. Please complete \"preprocess\" and \"partition\" methods. " + "We'll use this skeleton for implementing different supervised learning algorithms. For this first assignment, we'll read and partition the [\"madelon\" dataset](http://archive.ics.uci.edu/ml/datasets/madelon). Features and labels for the first two examples are listed below. Please complete \"preprocess\" and \"partition\" functions. " ] }, { @@ -180,7 +180,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next, you'll need to split your dataset into training, validation and test sets. The \"partition\" function should take as input the size of the whole dataset and randomly sample a proportion $t$ of the dataset as test partition and a proportion of $v$ as validation partition. The remaining will be used as training data. For example, to keep 30% of the examples as test and %10 as validation, set $t=0.3$ and $v=0.1$. You should choose these values according to the size of the data available to you. The \"split\" function should return indices of the training, validation and test sets. These will be used to index into the whole training set." + "Next, you'll need to split your dataset into training, validation and test sets. The \"partition\" function should take as input the size of the whole dataset and randomly sample a proportion $t$ of the dataset indices for test partition and a proportion of $v$ for validation partition. The remaining will be used as indices for training data. For example, to keep 30% of the examples as test and %10 as validation, set $t=0.3$ and $v=0.1$. You should choose these values according to the size of the data available to you. The \"split\" function should return indices of the training, validation and test sets. These will be used to index into the whole training set." ] }, { @@ -203,6 +203,8 @@ " 1D array containing test set indices\n", " val_indices: ndarray\n", " 1D array containing validation set indices\n", + " train_indices: ndarray\n", + " 1D array containing train set indices\n", " '''\n", " \n", " # np.random.permutation might come in handy. Do not sample with replacement!\n", @@ -215,7 +217,7 @@ " \n", " raise NotImplementedError\n", " \n", - " return test_indices, val_indices" + " return test_indices, val_indices, train_indices" ] }, { @@ -263,35 +265,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "class Model:\n", - " # preprocess_f and partition_f expect functions\n", - " # use kwargs to pass arguments to preprocessor_f and partition_f\n", - " # kwargs is a dictionary and should contain t, v, feature_file, label_file\n", - " # e.g. {'t': 0.3, 'v': 0.1, 'feature_file': 'some_file_name', 'label_file': 'some_file_name'}\n", - " \n", - " def __init__(self, preprocessor_f, partition_f, **kwargs):\n", - " \n", - " self.features, self.labels = preprocessor_f(kwargs['feature_file'], kwargs['label_file'])\n", - " self.size = len(self.labels) # number of examples in dataset \n", - " self.feat_dim = self.features.shape[1] # number of features\n", - " \n", - " self.val_indices, self.test_indices = partition_f(self.size, kwargs['t'], kwargs['v'])\n", - " self.val_size = len(self.val_indices)\n", - " self.test_size = len(self.test_indices)\n", - " \n", - " self.train_indices = np.delete(np.arange(self.size), np.append(self.test_indices, self.val_indices), 0)\n", - " self.train_size = len(self.train_indices)\n", " \n", - " def fit(self):\n", - " \n", - " raise NotImplementedError\n", + " def fit(self, training_features, training_labels):\n", + " print('There are {} data points in training partition with {} features.'.format(\n", + " training_features.shape[0], training_features.shape[1]))\n", + " return\n", " \n", - " def predict(self, indices):\n", - " raise NotImplementedError" + " def predict(self, test_points):\n", + " return" ] }, { @@ -313,20 +299,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will use a keyword arguments dictionary that conveniently passes arguments to functions that are themselves passed as arguments during object initialization. Please do not change these calls in this and the following assignments." + "Initialize the model and call fit method with the training features and labels." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# TODO\n", - "# pass the correct arguments to preprocessor_f and partition_f\n", - "kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': ..., 'label_file': ...}\n", - "my_model = Model(preprocessor_f=..., partition_f=..., **kwargs)\n", - "# Output size of the training partition" + "\n", + "# initialize model\n", + "my_model = Model()\n", + "# obtain features and labels from files\n", + "# partition the data set\n", + "# pass the training features and labels to the fit method" ] }, { @@ -340,7 +328,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Modify `preprocess` function such that the output features take values in the range [0, 1]. Initialize a new model with this function and check the values of the features." + "Implement `normalization` function such that the output features take values in the range [0, 1]. Check that the values of the features are in [0, 1]." ] }, { @@ -360,15 +348,24 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# TODO\n", - "# args is a placeholder for the parameters of the function\n", - "# Args and Returns are as in \"preprocess\"\n", - "def normalized_preprocess(args=...):\n", - " raise NotImplementedError" + "def normalization(raw_features):\n", + " '''\n", + " Args:\n", + " raw_features: ndarray\n", + " nxd array containing unnormalized features\n", + " Returns:\n", + " features: ndarray\n", + " nxd array containing normalized features\n", + " \n", + " '''\n", + " raise NotImplementedError\n", + " \n", + " return features" ] }, { @@ -379,8 +376,7 @@ "source": [ "# TODO\n", "\n", - "kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': ..., 'label_file': ...}\n", - "my_model = Model(preprocessor_f=..., partition_f=..., **kwargs)\n", + "features = normalization(features)\n", "\n", "# Check that the range of each feature in the training set is in range [0, 1]" ] diff --git a/ProgrammingAssignment_0/GettingFamiliar_solution.ipynb b/ProgrammingAssignment_0/GettingFamiliar_solution.ipynb index 73b654683801b1f268cdb6f6c3578f9f941d107f..25d9397e2fea0e0b2630973e837ad802c098f6b6 100644 --- a/ProgrammingAssignment_0/GettingFamiliar_solution.ipynb +++ b/ProgrammingAssignment_0/GettingFamiliar_solution.ipynb @@ -32,7 +32,7 @@ "\n", "| | Tasks | 478 | 878 |\n", "|---|---------------------------------------|-----|-----|\n", - "|4 | Modify `preprocess` for normalization | 5 | 10 |\n", + "|4 | Implement `normalization` | 5 | 10 |\n", "\n", "\n", "Points are broken down further below in Rubric sections. The **first** score is for 478, the **second** is for 878 students. There a total of 25 points in this assignment and extra 5 bonus points for 478 students." @@ -44,7 +44,7 @@ "source": [ "# Supervised Learning Model Skeleton\n", "\n", - "We'll use this skeleton for implementing different supervised learning algorithms. For this first assignment, we'll read and partition the [\"madelon\" dataset](http://archive.ics.uci.edu/ml/datasets/madelon). Features and labels for the first two examples are listed below. Please complete \"preprocess\" and \"partition\" methods. " + "We'll use this skeleton for implementing different supervised learning algorithms. For this first assignment, we'll read and partition the [\"madelon\" dataset](http://archive.ics.uci.edu/ml/datasets/madelon). Features and labels for the first two examples are listed below. Please complete \"preprocess\" and \"partition\" functions. " ] }, { @@ -56,7 +56,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -132,7 +132,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -179,7 +179,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -209,12 +209,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next, you'll need to split your dataset into training, validation and test sets. The \"partition\" function should take as input the size of the whole dataset and randomly sample a proportion $t$ of the dataset as test partition and a proportion of $v$ as validation partition. The remaining will be used as training data. For example, to keep 30% of the examples as test and %10 as validation, set $t=0.3$ and $v=0.1$. You should choose these values according to the size of the data available to you. The \"split\" function should return indices of the training, validation and test sets. These will be used to index into the whole training set." + "Next, you'll need to split your dataset into training, validation and test sets. The \"partition\" function should take as input the size of the whole dataset and randomly sample a proportion $t$ of the dataset indices for test partition and a proportion of $v$ for validation partition. The remaining will be used as indices for training data. For example, to keep 30% of the examples as test and %10 as validation, set $t=0.3$ and $v=0.1$. You should choose these values according to the size of the data available to you. The \"split\" function should return indices of the training, validation and test sets. These will be used to index into the whole training set." ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -253,9 +253,9 @@ " test_indices = permuted[:t_size]\n", " # and the next v_size for validation\n", " val_indices = permuted[t_size+1:t_size+v_size+1]\n", + " train_indices = np.delete(np.arange(size), np.append(test_indices, val_indices), 0)\n", " \n", - " \n", - " return test_indices, val_indices" + " return test_indices, val_indices, train_indices" ] }, { @@ -276,24 +276,24 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Test size: 600, validation size: 200\n" + "Test size: 600, validation size: 200, training size: 1200\n" ] } ], "source": [ "# TODO\n", "# Pass the correct size argument (number of examples in the whole dataset)\n", - "test_indices, val_indices = partition(size = features.shape[0], t = 0.3, v = 0.1)\n", + "test_indices, val_indices, train_indices = partition(size = features.shape[0], t = 0.3, v = 0.1)\n", "\n", - "# Output the length of both features and labels.\n", - "print('Test size: {}, validation size: {}'.format(test_indices.shape[0], val_indices.shape[0]))" + "# Output the length of both test and validation indices.\n", + "print('Test size: {}, validation size: {}, training size: {}'.format(test_indices.shape[0], val_indices.shape[0], train_indices.shape[0]))" ] }, { @@ -312,34 +312,18 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "class Model:\n", - " # preprocess_f and partition_f expect functions\n", - " # use kwargs to pass arguments to preprocessor_f and partition_f\n", - " # kwargs is a dictionary and should contain t, v, feature_file, label_file\n", - " # e.g. {'t': 0.3, 'v': 0.1, 'feature_file': 'some_file_name', 'label_file': 'some_file_name'}\n", - " \n", - " def __init__(self, preprocessor_f, partition_f, **kwargs):\n", - " \n", - " self.features, self.labels = preprocessor_f(kwargs['feature_file'], kwargs['label_file'])\n", - " self.size = len(self.labels) # number of examples in dataset \n", - " self.feat_dim = self.features.shape[1] # number of features\n", - " \n", - " self.val_indices, self.test_indices = partition_f(self.size, kwargs['t'], kwargs['v'])\n", - " self.val_size = len(self.val_indices)\n", - " self.test_size = len(self.test_indices)\n", " \n", - " self.train_indices = np.delete(np.arange(self.size), np.append(self.test_indices, self.val_indices), 0)\n", - " self.train_size = len(self.train_indices)\n", - " \n", - " def fit(self):\n", - " \n", - " raise NotImplementedError\n", + " def fit(self, training_features, training_labels):\n", + " print('There are {} data points in training partition with {} features.'.format(\n", + " training_features.shape[0], training_features.shape[1]))\n", + " return\n", " \n", - " def predict(self, indices):\n", + " def predict(self, test_points):\n", " raise NotImplementedError" ] }, @@ -362,29 +346,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will use a keyword arguments dictionary that conveniently passes arguments to functions that are themselves passed as arguments during object initialization. Please do not change these calls in this and the following assignments." + "Initialize the model and call fit method with the training features and labels." ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "There are 1200 data points in training partition.\n" + "There are 1200 data points in training partition with 500 features.\n" ] } ], "source": [ - "# TODO\n", - "# pass the correct arguments to preprocessor_f and partition_f\n", - "kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': '../data/madelon.data', 'label_file': '../data/madelon.labels'}\n", - "my_model = Model(preprocessor_f=preprocess, partition_f=partition, **kwargs)\n", - "# Output size of the training partition\n", - "print('There are {} data points in training partition.'.format(my_model.train_size))" + "my_model = Model()\n", + "# obtain features and labels from files\n", + "features, labels = preprocess('../data/madelon.data', '../data/madelon.labels')\n", + "# partition the data set\n", + "val_indices, test_indices, train_indices = partition(size, 0.3, 0.1)\n", + "# pass the training features and labels to the fit method\n", + "my_model.fit(features[train_indices], labels[train_indices])" ] }, { @@ -398,7 +383,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Modify `preprocess` function such that the output features take values in the range [0, 1]. Initialize a new model with this function and check the values of the features." + "Implement `normalization` function such that the output features take values in the range [0, 1]. Check that the values of the features are in [0, 1]." ] }, { @@ -418,18 +403,15 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "# TODO\n", "# args is a placeholder for the parameters of the function\n", "# Args and Returns are as in \"preprocess\"\n", - "def normalized_preprocess(feature_file, label_file):\n", - " \n", - " # read in features\n", - " raw_features = np.genfromtxt(feature_file)\n", - " \n", + "def normalization(raw_features):\n", + " \n", " # initialize an empty ndarray with the shape of raw_features\n", " dims = raw_features.shape\n", " features = np.empty(dims)\n", @@ -440,33 +422,29 @@ " max_val = max(col_values)\n", " features[:, col] = col_values/max_val \n", " \n", - " # read in labels\n", - " labels = np.genfromtxt(label_file)\n", - " \n", - " return features.T, labels" + " return features" ] }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Max value: 1.0, min value: 0.0\n" + "Min value: 0.0, max value: 1.0\n" ] } ], "source": [ "# TODO\n", "\n", - "kwargs = {'t': 0.3, 'v': 0.1, 'feature_file': '../data/madelon.data', 'label_file': '../data/madelon.labels'}\n", - "my_model = Model(preprocessor_f=normalized_preprocess, partition_f=partition, **kwargs)\n", + "features = normalization(features)\n", "# Check that the range of each feature in the training set is in range [0, 1]\n", "\n", - "print('Max value: {}, min value: {}'.format(my_model.features.max(), my_model.features.min()))" + "print('Min value: {}, max value: {}'.format(features.min(), features.max()))" ] } ],