## WS15 Lab 6/23: GMMs and Neural Nets

### I. Grokking the problem

Download the data file, hard2d.txt. This file contains one matrix, D=[ZETA',X'], ZETA=[zeta_1,...,zeta_n], X=[x_1,...,x_n], where zeta_i in (-1,1) is a binary class label, and x_i in [[0,1],[-1,1]] is a 2D real vector drawn randomly from the right half of the signum square.

The first thing you should do, with any dataset, is seek to gain a little intuitive understanding of the data. With a low-dimensional dataset like this one, you can get excellent intuition by drawing a scatter plot. Do that. In matlab, this is plot(X(1,ZETA==1),X(2,ZETA==1),'r.'); hold on; plot(X(1,ZETA==-1),X(2,ZETA==-1),'b.'); title('Scatter plot of the hard2d dataset'); Alternatively you can get a plot by downloading this function: nnplot.m. Download it to your working directory, and type help nnplot to learn how it is used, or just open it up by typing edit nnplot.m.

Print your scatter plot as an image (print -dpng fig1.png).

### II. Neural Networks

In this part of the problem you will create a two-layer neural net with 24 hidden nodes, with tanh activation functions on the outputs of the hidden nodes, and a tanh activation on the output node. You will try several different methods for training the network, in order to see how initial conditions can dramatically change the quality of the resulting classifier.

#### II.A Knowledge-Based Initialization

Write a function [Z,Y]=nnclassify(X,U,V) that takes as input: (1) a matrix of observations, X, whose dimension is p by n, (2) the first-level matrix of weights and biases, U, whose dimension is q by (p+1) (where q is the number of hidden nodes, I recommend q=24 but Amit used q=12, and p=2 is the number of input dimensions), and (2) the second level vector of weights and biases, V, whose dimension is 1 by (q+1). It should compute, as output, the axon values of the hidden and output layers (Y and Z). For example, your function might include lines of code that look like A=U*[ones(1,n);X];, Y=tanh(A), B=V*[ones(1,n);Y];, and Z=tanh(B);

Look at the scatter plot (print it out, if you can). Use ginput to find some points on the boundary between the two classes---the slides have a few pages of sample code that should help you do this. As shown in the slides, you can check your lines by using code something like nnplot(X,ZETA,ZETA,'',1);for m=1:24,plot([0 1],-U(m,1)/U(m,3)+[0,-U(m,2)/U(m,3)]);end --- this will create the scatterplot, then overlay the line corresponding to each row of the U matrix.

Set V=[THRESHOLD,ones(1,q)], where THRESHOLD is a number you choose so that the classifier gives the right answer. If you have 6 lines bounding each region, then probably you'll have THRESHOLD equal to either 5.5 or -5.5, depending on which direction your U matrix came out.

Run nnclassify to generate Z, and plot a scatter plot showing the Z labeling of each token. Make sure it looks the way you expected; if it doesn't, find out why. Compute the error rate of your classifier (ER=sum(ZETA.*Z<0)/n;), and list the error rate in the title of your figure. Print a copy of the figure as an image.

In case you have trouble finishing this part of the lab, you can copy the following U and V matrices. U=[-20.0000, 1.8872, -20.0000; -20.0000, 6.2628,-17.2728; -20.0000, 9.3492, -14.2008; 10.0000, 1.8872, 20.0000; 5.9092, 6.2628,17.2728; 1.3012,9.3492, 14.2008; -10.0000, 1.8872,-20.0000; -11.3636, 6.2628, -17.2728; -12.8996, 9.3492, -14.2008; 0, 1.8872, 20.0000; -2.7272, 6.2628, 17.2728; -5.7992, 9.3492, 14.2008; 0, 1.8872, -20.0000; -2.7272, 6.2628, -17.2728; -5.7992, 9.3492,-14.2008; -10.0000, 1.8872, 20.0000; -11.3636, 6.2628, 17.2728; -12.8996, 9.3492, 14.2008; 10.0000, 1.8872,-20.0000; 5.9092, 6.2628, -17.2728; 1.3012, 9.3492, -14.2008; -20.0000, 1.8872, 20.0000; -20.0000, 6.2628, 17.2728; -20.0000, 9.3492, 14.2008];, and V=[-5.5,-ones(1,24)];

#### II.B Train Using Back-Propagation

Create a function [EPSILON,DELTA]=nnbackprop(X,Y,Z,ZETA,V) calculating the labeling mean-squared error (MSE) back-propagated to the second and first layers of the network, respectively; thus EPSILON is a 1xn vector, and DELTA is a qxn matrix. The function backprop should probably contain lines of code that look something like EPSILON=2*(1-Z.^2).*(Z-ZETA); and DELTA=(1-Y.^2).*(V(:,2:(q+1))'*EPSILON);

Create a function [dU,dV]=nngradient(X,Y,DELTA,EPSILON); that finds the gradient of MSE with respect to U and V. Your function will probably include lines like dU=(1/n)*DELTA*[ones(1,n);X]'; and dV=(1/n)*EPSILON*[ones(1,n);Y].

Create a function [U,V]=nndescent(X,ZETA,U0,V0,ETA,T) that performs T iterations of gradient descent, with a learning rate of ETA, from a starting point of U0,V0.

Start with the knowledge-based neural net of the previous subsection. Perform 1000 iterations of gradient descent, with a learning rate of 0.1. You should wind up with an error rate similar to last time, but slightly better; what is it? Create a scatter plot of the new labeling, and put the error rate in the title.

#### II.C Back-Prop from Random Initial Conditions

Discard your previous weight matrices. Instead, randomly initialize the weight matrices, then perform 1000 iterations of gradient descent with a learning rate of 0.002. What is the error rate now? What is the scatter plot? Try running gradient descent for another 1000 iterations, and another. Do the results get better?

#### II.D Smart Initialization: Simulated Annealing

Use simulated annealing to find the global optimum.

I used Gaussianized annealing with 5000 iterations. Gaussian simulated annealing works like this: compute a step in the direction of the gradient descent, using your back-propagation code from II.C. Compute the training error rate at this initial location. This is your initial error. The initial location is the beginning state of a Markov chain. Now start SA with some initial "hot" temperature. This notion of temperature could simply be the iteration number. Then add a Gaussian random component to the weight vector keeping all the other components of the weight vector fixed. Compute the training error rate at the new location. If the random location is better than the initial location, count this is an acceptance of a Markov chain state transition and go to that state (location). If it's worse, then you can still count this as an acceptance with probability P=exp(-(error_difference)*log(t+1)/ridge), where ridge is the difference between the maximum possible versus the minimum possible error rate that could be achieved on your training dataset, and t is the iteration number. Repeat this process of adding a Gaussian random component to a different component each time until you have swept through all the components of the weight vector (i.e (P+1)*Q + (Q+1)). Sweeping through all the components once constitutes one trial. To establish stationarity, you will need to perform multiple trials at the same temperature. The number of trials required to establish stationarity is heuristic in nature and could be based on some pre-determined values for maximum number of acceptances or rejections or simply the maximum number of trials. Once you have established stationarity, you can decrease the temperature by a small amount. Now, start all over again as you did for the previous temperature - i.e., adding Gaussian random components and then establishing stationarity. It is expected that if you continue doing this for each decreasing temperature, then with high probability you can find the global minimum after establishing stationarity at some minimum temperature. You can set the minimum temperature heuristically. For example, it could be the last iteration number. The final weight vector is the optimal weight vector determined by simulated annealing.

Every time you encounter a weight matrix that is better than its predecessor and successor, you should append that weight matrix to the back of auxiliary rank-3 tensor, cell array, or file, and store its associated cost in an auxiliary cost vector; you should also (automatically) output the very last weight matrix as an option.

Discretized simulated annealing is random in a discrete state space, rather than a continuous state space. The basic idea is to discretize the space of all possible weight vectors (e.g., by quantizing every weight into bins of width 0.1). Compute a gradient update using back-propagation. Then move to a different grid in the weight space, uniformly distributed over all possible grid locations. Compute the training error in the old location and the new location. If the new location has lower error, go there; if not, go there anyway, with probability P=exp(-(error_difference)*log(t+1)/ridge). Every time you find a local optimum (better than the iteration that preceded it, and better than the iteration that follows it), save the weight vector.

Create a scatter plot of the labeling function achieved in this way, and compute its error rate. If you run simulated annealing long enough, it should actually beat the knowledge-based initialization; but that might take a very, very long time.

### III. Gaussian Mixture Models

In this part of the problem you will create a Gaussian mixture model with 4-5 Gaussians/class. You will try several different methods for training the GMM, in order to see how initial conditions can dramatically change the quality of the resulting classifier.

#### III.A Knowledge-Based Initialization

Use edit gmmclassify.m to open a new matlab function file, and define its content with a function declaration on the first line of the form function [Z,G]=gmmclassify(X,C,MU,SIG2). Add some comments (lines starting with %) specifying that this file computes the label corresponding to each observation. The observation matrix, X, is NxP, where N is the number of observations, and P is the dimension of each observation. The matrix of mixture weights, C, is Kx2, where 2 is the number of classes (zeta=+1 vs. zeta=-1), and K is the number of Gaussians. The matrix of means, MU, is KxP. The matrix of variances, SIG2, is also KxP: each row stores the variances corresponding to one Gaussian. Your code might include a line something like G(n,k)=exp(-sum((X(n,:)-MU(k,:)).^2./SIG2(k,:))-0.5*sum(log(SIG2(k,:))));, and another line something like classprob=G*C;, and another line something like Z(classprob(:,2)>classprob(:,1))=-1;.

Estimate the means and variances of four Gaussians in class one, and five Gaussians in class two, such that the boundary between the classes matches the boundary you see in the scatter plot (remember that variance is the square of standard deviation!). Test your hypothesis by classifying the data. Create a scatter plot showing how the GMM labels the data, and fix any large mistakes: this scatter plot should roughly comparable to the correct scatter plot, but might not be exactly the same. Compute the error rate of your classifier (errorrate=sum(ZETA~=Z)/length(Z);), and list the error rate in the title of your figure (using title(sprintf('GMM w/KB Params, Error=%g',errorrate));). Print a copy of the figure as an image, and hand it in with your report.

#### III.B Train Using EM

Create a matlab function [C,MU,SIG2]=gmmtrain(X,ZETA,C0,MU0,SIG20); that accepts labeled data and initial parameter estimates, and performs one iteration of EM (expectation maximization) in order to ``improve'' the parameter estimates.

The first thing your code should do is to compute the posterior probability of the k'th Gaussian given the n'th observation and the zeta'th class, for each possible combination of n, k, and zeta. This might be done by calling gmmclassify to compute the Gaussian probability densities, and then computing the posterior probability as GAMMA(n,k,zeta)=G(n,k)*C0(k,zeta)/(G(n,:)*C0(:,zeta));. You will also need to know the ``expected number of times'' that the k'th Gaussian was chosen: this is NU(k,zeta)=sum(GAMMA(:,k,zeta));. You will also need to know the sum of the gamma-probabilities summed across class labels, e.g., GAMMASUM(n,k)=sum(GAMMA(n,k,:));. This part of the code is called the ``E-step,'' because it calculates the expected number of times that the k'th Gaussian was chosen.

The next thing your code should do is find new values of the mixture weights C, the means MU, and the variances SIG2. This is called the ``M-step,'' because the new parameters are chosen in order to maximize the likelihood (likelihood = prob(data|parameters)). The maximum likelihood mixture weight is just the number of times a component was chosen, divided by the number of times it could have been chosen, thus C(k,zeta)=NU(k,zeta)/sum(NU(:,zeta));. Notice that this formula has a special property that is characteristic of the EM algorithm: if C0(k,zeta)==0 for any particular class zeta and Gaussian k, then it will also be true that NU(k,zeta)==0, and therefore C(k,zeta)==0. This property is extremely useful: it means that you can divide the Gaussians between two classes (for example, setting C0(1:4,2)=0 and C0(5:8,1)=0), and they will stay divided, even after re-training.

The new mean vector is a weighted average of the observation vectors, thus MU(k,:)=(GAMMASUM(:,k)'*X)/sum(GAMMASUM(:,k));. The new sigma vector is computed exactly the same way, but first you need to compute the squared deviation of each observation dimension from the mean of each Gaussian, thus SQDEV(n,:,k)=(X(n,:)-MU(k,:)).^2;. The variance is then the weighted average deviation: SIG2(k,:)=(GAMMASUM(:,k)'*SQDEV(:,:,k))/sum(GAMMASUM(:,k));

It's important to make sure that the variance never gets too close to zero. The EM algorithm won't take care of this automatically for you; you need to do it yourself. For example, after you've computed SIG2, you can lowerbound it with something like SIG2=max(1e-4,SIG2);

Run the gmmtrain function once, using your knowledge-based parameters from II.A as an initial condition. Oops. What happened? Create a scatter plot, showing the labeling function computed by the trained GMM parameters. Take a look at the new mean vectors. Can you tell what happened?

If your code is working correctly, you should find that all of the mean vectors have moved out toward the center of the dataset, and that in doing so, they have really screwed up the classification boundaries. The EM algorithm estimates GMM parameters in order to maximize the likelihood of the observations given the labels. But maximum likelihood is a really terrible training criterion for this dataset. The best classification accuracy is achieved by mean vectors that are out toward the edges of the dataset (as you should have guessed when you initialized the parameters in part III.A), but the maximum likelihood mean vectors are closer to the center of the dataset (in order to better summarize all of the data).

#### III.C MCE Training of a GMM

Devise a training algorithm for the GMM that explicitly minimizes classification error (MCE training). For example, you could estimate the error rate as the average of E=average(y*tanh(a*LLR)), where "y" is the true class label, "a" is a slope hyperparameter, and LLR=log(p(x|class 1)/p(x|class -1)). Can you find the gradient of E with respect to the variances of the Gaussians? Then try writing a gradient descent algorithm that explicitly minimizes the error rate in the training set.

This time, if your code is working, you should find that the error rate on the training set goes down.