Project 4 - MNIST Digit Recognition Using Neural Network (Due 11/30)

Objective:

The objective of this project is to have a thorough understanding of the traditional multi-layer perceptron and backpropagation.

Data Sets:

MNIST and XOR

Tasks:

(20 pts) Task 1: Go through Nielsen's book on Neural Network and Deep Learning. Ideally, by the end of the semester, you should finish reading at least Chapters 1 through 3. For this project, you should read through Chapter 1 and to answer some of the questions, you also need to selectively read through Chapter 3. Answer the following question in the report after reading. The answer to each question should take less than a third of a page, assuming single space and a font size of 12 Times.
- (4) Explain the differences between batch processing and online processing
- (4) Explain the differences between gradient descent and stochastic gradient descent. What is mini-batch processing?
- (4) Explain the differences between perceptron and sigmoid neurons. What's the advantage of a sigmoid neuron as compared to perceptron?
- (4) Explain the differences between feedforward vs. backpropagation. What is back propagated and what is forwarded?
- (4) Why do we need to introduce "bias" when training a neural network?
(40 pts) Task 2: Download the code set from GitHub, debug, and make it run locally

(10) Task 2.1: Nielsen was able to use a simple structure to train a network to recognize the handwritten digits (MNIST). Use the same setup as he used and plot the convergence curve using MNIST. Note that the convergence curve is a plot of classification accuracy (or error) vs. epoch. Nielsen's code has already output the accuracy at the end of each epoch. You just need to plot it on Figure 1.
(30) Task 2.2: Play around the hyperparameter setups and draw three figures showing the effect of changing each hyperparameter to the convergence curve.
- (10) Effect of network structure (Figure 2). On the same figure, show at least five convergence paths using Nielsen's baseline, different number of hidden layers and different number of hidden nodes in each layer. Keep learning rate at 1, mini-batch size at 15, but extend the number of epochs to 100 (or whichever constant you and your computer have patience with). Provide comments on why different performance is observed if any at all. Identify the one structure that has the best performance.
- (10) Effect of mini-batch size (Figure 3). Use the network structure with the best performance and modify mini-batch size. Show at least 4 convergence paths with different mini-batch sizes. Comment on the different performances obtained using the different mini-batch size. Identify the size with the best performance.
- (10) Effect of learning rate (Figure 4). Use the network structure and the mini-batch size with the best performance and change learning rate from 1, 0.1, 0.01, to 0.001. Show the convergence curves. Comment on the different performances obtained using the different learning rate. Identify the rate with the best performance.

(40) Task 3: Solving the XOR problem. The four training samples are [[-1, -1], [-1, 1], [1, -1], [1, 1]], with the corresponding label being [-1, 1, 1, -1]. You have experienced in HW5 that single-layer perceptron cannot realize XOR since the decision boundary is not linear in the 2-d space.
- (15) Task 3.1: Use the network library you developed in Task 2 to implement the XOR gate. Going through the practices you had in Task 2.2. Fix mini-batch size at 1. Show effect of network structure, different learning rate, and different initialization strategy (i.e., random initialization vs. initialization with symmetry in the weight. See details in Slide 22 of Lecture 13). You probably spent more time on XOR than on MNIST. Comment on why it is more difficult to train a smaller network than a larger one.
- (25) Task 3.2: Use SVM for XOR. Here, we use the kernel trick to solve this problem. Kernel tricks use a kernel function to project the data from the original space (in this case, 2-d space) to a higher-dimensional space where a linear boundary can be found to perfectly separate the samples.
  - Suppose the kernel function is a 2nd-degree polynomial, i.e., K(x, y) = (x1y1 + x2y2 + C)^2, where x = [x1, x2]^T, y = [y1, y2]^T. Derive the basis functions \phi(x), that is, K(x, y) = \phi^T(x).\phi(y)
  - Using the derived basis function, what is the higher-dimensional space that the 2-d sample should be mapped to? Provide the higher-dimensional counterpart to the four 2-d samples.
  - Apply Perceptron on the higher-dimensional samples and see if you can find a linear decision boundary using the higher-dimensional samples. Output the weights learned. Project the learned hyperplane onto 2-D space and show the decision boundary.