Project 2 - Dimensionality Reduction and Unsupervised Learning (Due 10/17)

Objective:

The objective of this project is, first of all, to practice the usage of dimensionality reduction as one of the preprocessing steps and study its impact to the classification performance. You need to use both supervised dimensionality reduction method (i.e., FLD) and unsupervised method (i.e., PCA and t-SNE) for that purpose. The second objective is to get an in-depth understanding of unsupervised learning algorithms and how to apply that for classification purpose. The third objective is to extend the horizon of machine learning applications and solve a seemingly quite unrelated problem -- image compression. Think hard if image compression should be solved using supervised learning or unsupervised learning, and what are the features in this application.

Data Sets:

Two datasets will be used. The first is the popular MNIST. The second is a beautiful color image of flowers.

Performance Metrics:

Besides the three metrics used in Project 1, i.e., 1) overall classification accuracy, 2) classwise classification accuracy (or confusion matrix), and 3) run time, we'll introduce a fourth metric that measures the quality of the compressed image as compared to the original full-color image - 4) root mean squared error (RMSE).

Tasks:

Task 1: Visualize the MNIST dataset on a 2-D plane using both supervised and unsupervised dimensionality approaches (FLD, PCA, t-SNE). Comment on what distribution you think would be appropriate to describe these datasets. This should give you some hint on if non-parametric learning might be a better option or not.
- nX (5 pts): the standardized dataset
- fX (10 pts): the projected data from FLD. Note that since mnist has c=10 classes, the number of dimensions it will reduce to should be m = c - 1 = 9 dimensions. In order to be able to visualize the data, you can further reduce it to 2 using t-SNE (tfX) or PCA (pfX).
- pX (10 pts): the projected data from PCA. Always report the error rate introduced by pX.
- tX (10 pts): the reduced data from t-SNE
Task 2: Implement kmeans, wta, and hierarchical aggolomerative clustering approaches. Use them to solve the image compression problem. Each pixel of this color image has three components: red, green, and blue. Each component is an 8-bit unsigned char. That is, each pixel is represented using 24 bits, a total of 2^24 possible colors. You are asked to use less number of bits to represent each pixel. For example, if you want to only use 256 colors to represent the original full-color image, then you are basically only using 8 bits to represent each pixel, instead of 24 bits. We refer to the color image not showing its full-color potential as pseudo-color image.
- (25 pts) Draw a table with 4 rows and 3 columns showing the generated pseudocolor images with k = 256, 128, 64, and 32 different colors using kmeans and wta. Underneath each image, display the reconstruction error measured in terms of RMSE.
- (5 pts) Draw the convergence curve for each of the three clustering algorithms. Use k=32.
- (5 pts) Comment on the results both visually and through quantitative measurement (i.e., RMSE).
Task 3 (25 pts): Apply all four supervised classifers you developed in Project 1 on nX, fX, pX (use 10% error rate), and tX. For kNN, use the "k" where you obtain the best performance on nX. Also apply the unsupervised methods you developed in Task 2 on these datasets. Think of the best way you can to organize the results and report them.
Final discussion (5 pts).