## Assignment 3: Neural Nets and PyTorch

### Due date: Wednesday March 17th, 11:59pm

image from Wikipedia

Created by Justin Lizama, Kedan Li, and Tiantian Fang

Updated fall 2020 by Jatin Arora, Kedan Li, and Michal Shlapentokh-Rothman

Updated spring 2021 by Mahir Morshed and Yangge Li

The goal of this assignment is to employ neural networks, nonlinear and multi-layer extensions of the linear perceptron, to detect whether or not images contain animals.

In the first part, you will create an 1980s-style shallow neural network. In the second part, you will improve this network using more modern techniques such as changing the activation function, changing the network architecture, or changing other initialization details.

You will be using the PyTorch and NumPy libraries to implement these models. The PyTorch library will do most of the heavy lifting for you, but it is still up to you to implement the right high-level instructions to train the model.

## Dataset

The dataset consists of 10000 32x32 colored images (a subset of the CIFAR-10 dataset, provided by Alex Krizhevsky), split for you into 7500 training examples (of which 2999 are negative and 4501 are positive) and 2500 development examples.

The data set can be downloaded here: (gzip) or (zip). When you uncompress this you'll find a binary object that our reader code will unpack for you.

## Part 1: Classical Shallow Network

The basic neural network model consists of a sequence of hidden layers sandwiched by an input and output layer. Input is fed into it from the input layer and the data is passed through the hidden layers and out to the output layer. Induced by every neural network is a function $$F_{W}$$ which is given by propagating the data through the layers.

To make things more precise, in lecture you learned of a function $$f_{w}(x) = \sum_{i=1}^n w_i x_i + b$$. In this assignment, given weight matrices $$W_1,W_2$$ with $$W_1 \in \mathbb{R}^{h \times d}$$, $$W_2 \in \mathbb{R}^{h \times 2}$$ and bias vectors $$b_1 \in \mathbb{R}^{h}$$ and $$b_2 \in \mathbb{R}^{2}$$, you will learn a function $$F_{W}$$ defined as $F_{W} (x) = W_2\sigma(W_1 x + b_1) + b_2$ where $$\sigma$$ is your activation function. In part 1, you should use either of the sigmoid or ReLU activation functions. You will use 32 hidden units ($$h=32$$) and 3072 input units, one for each channel of each pixel in an image ($$d=(32)^2(3) = 3072$$).

### Training and Development

• Training: To train the neural network you are going to need to minimize the empirical risk $$\mathcal{R}(W)$$ which is defined as the mean loss determined by some loss function. For this assignment you can use cross entropy for that loss function. In the case of binary classification, the empirical risk is given by $\mathcal{R}(W) = \frac{1}{n}\sum_{i=1}^n y_i \log \hat y_i + (1-y_i) \log (1 - \hat y_i) .$ where $$y_i$$ are the labels and $$\hat y_i$$ are determined by $$\hat y_i = \sigma(F_{W}(x_i))$$ where $$\sigma(x) = \frac{1}{1+e^{-x}}$$ is the sigmoid function. For this assignment, you won't have to implement these functions yourself; you can use the built-in PyTorch functions.

Notice that because PyTorch's CrossEntropyLoss incorporates a sigmoid function, you do not need to explicitly include an activation function in the last layer of your network.

• Development: After you have trained your neural network model, you will have your model decide whether or not images in the development set contain animals in them. This is done by evaluating your network $$F_{W}$$ on each example in the development set, and then taking the index of the maximum of the two outputs (argmax).
• Data Standardization: Convergence speed and accuracies can be improved greatly by simply centralizing your input data by subtracting the sample mean and dividing by the sample standard deviation. More precisely, you can alter your data matrix $$X$$ by simply setting $$X:=(X-\mu)/\sigma$$.
With the aforementioned model design and tips, you should expect around 0.84 dev-set accuracy.

## Part 2: Modern Network

In this part, you will try to improve your performance by employing more modern machine learning techniques. These include, but are not limited to, the following:
1. Choice of activation function: Some possible candidates include Tanh, ELU, softplus, and LeakyReLU. You may find that choosing the right activation function will lead to significantly faster convergence, improved performance overall, or even both.
2. L2 Regularization: Regularization is when you try to improve your model's ability to generalize to unseen examples. One commonly used form is L2 regularization. Let $$\mathcal{R}(W)$$ be the empirical risk (mean loss). You can implement L2 regularization by adding an additional term that penalizes the norm of the weights. More precisely, your new empirical risk becomes $\mathcal{R}(W):= \mathcal{R}(W) + \lambda \sum_{W \in P} \Vert W \Vert_2 ^2$ where $$P$$ is the set of all your parameters and $$\lambda$$ (usually small) is some hyperparameter you choose. There are several other techniques besides L2 regularization for improving the generalization of your model, such as dropout or batch normalization.
3. Network Depth and Width: The sort of network you implemented in part 1 is a two-layer network because it uses two weight matrices. Sometimes it helps performance to add more hidden units or add more weight matrices to obtain greater representation power and make training easier.
4. Using Convolutional Neural Networks: While it is possible to obtain nice results with traditional multilayer perceptrons, when doing image classification tasks it is often best to use convolutional neural networks, which are tailored specifically to signal processing tasks such as image recognition. See if you can improve your results using convolutional layers in your network.
Try to employ some of these techniques in order to attain an approximately 0.87 dev-set accuracy. The only stipulation is that you use under 500,000 total parameters. This means that if you take every floating point value in all of your weights including bias terms, you only use at most 500,000 floating point values.

### Some things to look for:

1. The autograder runs the training process for 500 batches (max_iter=500). This is done so that we have a consistent training process for each evaluation and comparison with benchmarks/threshold accuracies.
2. You still have one thing in your full control, however—the learning rate. In case you are confident about a model you implemented but are not able to pass the accuracy thresholds on gradescope, you can try increasing the learning rate. It is certainly possible that your model could do better with more training. Be mindful, however, that using a very high learning rate might deteriorate performance as well since the model may begin to oscillate around the optima.

## Provided Code Skeleton

We have provided (tar/zip) all the code to get you started on your MP, which means you will only have to implement the PyTorch neural network model.

• reader.py - This file is responsible for reading in the data set. It creates a giant NumPy array of feature vectors corresponding to each image.
• mp3.py - This is the main file that starts the program, and computes the accuracy, precision, recall, and F1-score using your implementation.
• neuralnet_part1.py and neuralnet_part2.py These are the files where you will be doing all of your work. You are given a NeuralNet class which implements a torch.nn.module. This class consists of __init__(), forward(), and step() functions. (Beyond the important details below, more on what each of these methods in the NeuralNet class should do is given in the skeleton code.)
• __init__() is where you will need to construct the network architecture. There are multiple ways to do this.
• One way is to use the Linear and Sequential objects. Keep in mind that Linear uses a Kaiming He uniform initialization to initialize the weight matrices and sets the bias terms to all zeros.
• Another way you could do things is by explicitly defining weight matrices W1, W2, ... and bias terms b1, b2, ... by defining them as Tensors. This approach is more hands on and will allow you to choose your own initialization. For this assignment, however, Kaiming He uniform initialization should suffice and should be a good choice.
Additionally, you can initialize an optimizer object in this function to use to optimize your network in the step() function.
• get_parameters() should do what its name suggests--namely, return a list of parameters used in the model. (This and set_parameters() will only be tested with respect to part 1, but you may find implementing and using these helpful for part 2.)
• set_parameters() should do what its name suggests--namely, set the parameters of the model based on those input to this method. For consistency's sake, the order of the parameters should be the same as those returned in get_parameters(). (This and get_parameters() will only be tested with respect to part 1, but you may find implementing and using these helpful for part 2.)
• forward() should perform a forward pass through your network. This means it should explicitly evaluate $$F_{W}(x)$$ . This can be done by simply calling your Sequential object defined in __init__() or (if you opted to define tensors explicitly) by multiplying through the weight matrices with your data.
• step() should perform one iteration of training. This means it should perform one gradient update through one batch of training data (not the entire set of training data). You can do this by either calling loss_fn(yhat,y).backward() then updating the weights directly yourself, or you can use an optimizer object that you may have initialized in __init__() to help you update the network. Be sure to call zero_grad() on your optimizer in order to clear the gradient buffer. When you return the loss_value from this function, make sure you return loss_value.item() (which works if it is just a single number) or loss_value.detach().cpu().numpy() (which separates the loss value from the computations that led up to it, moves it to the CPU—important if you decide to work locally on a GPU, bearing in mind that Gradescope won't be configured with a GPU—and then converts it to a NumPy array). This allows proper garbage collection to take place (lest your program possibly exceed the memory limits fixed on Gradescope).
• fit() takes as input the training data, training labels, development set, and the maximum number of iterations. The training data provided is the output from reader.py. The training labels is a Tensor consisting of labels corresponding to each image in the training data. The development set is the Tensor of images that you are going to test your implementation on. The maximium number of iterations is the number you specified with --max_iter (it is 500 by default). fit() outputs the predicted labels. It should construct a NeuralNet object, and iteratively call the neural net's step() to train the network. This should be done by feeding in batches of data determined by batch size. You will use a batch size of 100 for this assignment. max_iter is the number of batches (not the number of epochs!) in your training process.

The only files you will need to modify are neuralnet_part1.py and neuralnet_part2.py.

You should definitely use the PyTorch documentation, linked multiple times on this page, to help you with implementation details. You can also use this PyTorch Tutorial as a reference to help you with your implementation. There are also other guides out there such as this one.

## Deliverables

This MP will be submitted via Gradescope; please upload neuralnet_part1.py (for part 1) and neuralnet_part2.py (for part 2).

## Extra credit: CIFAR-100 superclasses

For an extra 10% worth of the points on this MP, your task will be to pick any two superclasses from the CIFAR-100 dataset (described in the same place as CIFAR-10) and rework your neural net from part 2, if necessary, to distinguish between those two superclasses. A superclass contains 2500 training images and 500 testing images, so between two superclasses you will be working with 3/5 the amount of data in total (6000 total images here versus 10000 total in the main MP).

You can download the CIFAR-100 data here and extract it to the same place where you've placed the data for the main MP. A custom reader for it is provided here; to use it with the CIFAR-100 data, you should rename this to reader.py and replace the existing file of that name in your working directory.

To set up your code for the extra credit, you must do the following:

• Note that your revised neural net must still have fewer than 500,000 total parameters.
• Define two variables class1 and class2 at the top level of the file containing your neural net (that is, outside of the NeuralNet class) to stand for indices representing superclasses from CIFAR-100. The order of the superclasses listed on the CIFAR description page hints at the index for each superclass; for example, "aquatic mammals" is 0 and "vehicles 2" is 19.
• Rename the file containing the neural net to submit to neuralnet.py.