Imagine that a researcher gives you a data set to analyse consisting of the measured heights of 100 adults. The problem is, the researcher failed to distinguish between males and females when taking measurements. When you plot the distribution of heights you see two clear peaks - most likely corresponding to a peak in male heights and a peak in female heights. Unfortunately, it is difficult to see from this messy data what is the average height of men and the average height of women, and it is even more difficult to say if a particular observation was a man or a woman in retrospect.
After firing your researcher, you may want to turn to Gaussian mixture models (GMMs) for a practical solution. A GMM assumes that the observed data is made up of a mixture of several Gaussian distributions. These individual distributions (referred to as mixture components) may be given different means and variances. They may also be given different mixture weights. The finaldistribution is obtained by multiplying each mixture component by its associated mixture weight before and adding them together (mixture weights must sum to one). In this example we would want two Gaussian distributions - one for men and one for women - with different means and perhaps different variances, and where the mixture weights correspond to the probability of a random individual being male or female (roughly 0.5 in each case).
Creating a mixture like the one above is very easy. Inferring the values of the unknown means, variances and mixture weights from an observed data set is slightly more tricky. Two commonly used methods are maximum-likelihood estimation and Bayesian posterior inference.
Maximum likelihood usually works via the Expectation-maximisation (EM) algorithm. In each step of the algorithm the unknown parameters are updated to their conditional maximum likelihood values, given everything else. After a few steps the values converge to a local optimum. We have no guarantee that this local optimum is globally the best solution, but by running the algorithm from many dispersed starting positions we can at least gain some confidence in our result. Dempster, Laird and Rubin (1977) "Maximum Likelihood from Incomplete Data via the EM Algorithm" is the standard citation in this area, although a quick Google search for "EM Gaussian Mixture" returns many more gentle introductions.
Bayesian inference requires us to specify priors on our unknown parameters, and then proceeds via Markov Chain Monte Carlo (MCMC). The final algorithm is similar to the EM algorithm, except that the unknown values are drawn at random from the appropriate distribution in each step, rather than being set at the maximum likelihood value. If conjugate priors are used then very efficient algorithms can be designed by Gibbs sampling, or for more general priors a Metropolis-Hastings step can be used. The expressions that are required as part of this algorithm in fact very simple - far simpler than you might think from browsing the literature. If you are interested in pursuing this angle let me know and I'll write them down in a reply as simply as I can.
The EM approach has the advantage of being very quick - the algorithm often converges within 10 or so iterations. It is most appropriate when the different clusters in the data are very obvious, so there is little variance around the maximum likelihood estimate. The Bayesian approach takes longer, as many samples are needed in order to produce an accurate reconstruction of the posterior distribution. The advantage of this approach, however, is that the uncertainty in the final values is captured completely. For example, following an MCMC one could write down the posterior probability that any given observation was male or female based on the observed height alone. As with all Bayesian procedures, the influence of the prior can be made small relative to the data whenever the sample size is reasonably large, or alternatively, strong informative priors can be used to integrate alternative sources of information when they are available.
There are some nuances to mixture models that can be tricky to overcome - for example the "label-switching problem", or the issue of estimating the number of mixture components - but the basic framework of GMMs is straightforward. The added flexibility that comes from using mixtures, compared with simple parametric forms, is worth the effort in many applications.
Imagine that a researcher gives you a data set to analyse consisting of the measured heights of 100 adults. The problem is, the researcher failed to distinguish between males and females when taking measurements. When you plot the distribution of heights you see two clear peaks - most likely corresponding to a peak in male heights and a peak in female heights. Unfortunately, it is difficult to see from this messy data what is the average height of men and the average height of women, and it is even more difficult to say if a particular observation was a man or a woman in retrospect.
After firing your researcher, you may want to turn to Gaussian mixture models (GMMs) for a practical solution. A GMM assumes that the observed data is made up of a mixture of several Gaussian distributions. These individual distributions (referred to as mixture components) may be given different means and variances. They may also be given different mixture weights. The finaldistribution is obtained by multiplying each mixture component by its associated mixture weight before and adding them together (mixture weights must sum to one). In this example we would want two Gaussian distributions - one for men and one for women - with different means and perhaps different variances, and where the mixture weights correspond to the probability of a random individual being male or female (roughly 0.5 in each case).
Creating a mixture like the one above is very easy. Inferring the values of the unknown means, variances and mixture weights from an observed data set is slightly more tricky. Two commonly used methods are maximum-likelihood estimation and Bayesian posterior inference.
Maximum likelihood usually works via the Expectation-maximisation (EM) algorithm. In each step of the algorithm the unknown parameters are updated to their conditional maximum likelihood values, given everything else. After a few steps the values converge to a local optimum. We have no guarantee that this local optimum is globally the best solution, but by running the algorithm from many dispersed starting positions we can at least gain some confidence in our result. Dempster, Laird and Rubin (1977) "Maximum Likelihood from Incomplete Data via the EM Algorithm" is the standard citation in this area, although a quick Google search for "EM Gaussian Mixture" returns many more gentle introductions.
Bayesian inference requires us to specify priors on our unknown parameters, and then proceeds via Markov Chain Monte Carlo (MCMC). The final algorithm is similar to the EM algorithm, except that the unknown values are drawn at random from the appropriate distribution in each step, rather than being set at the maximum likelihood value. If conjugate priors are used then very efficient algorithms can be designed by Gibbs sampling, or for more general priors a Metropolis-Hastings step can be used. The expressions that are required as part of this algorithm in fact very simple - far simpler than you might think from browsing the literature. If you are interested in pursuing this angle let me know and I'll write them down in a reply as simply as I can.
The EM approach has the advantage of being very quick - the algorithm often converges within 10 or so iterations. It is most appropriate when the different clusters in the data are very obvious, so there is little variance around the maximum likelihood estimate. The Bayesian approach takes longer, as many samples are needed in order to produce an accurate reconstruction of the posterior distribution. The advantage of this approach, however, is that the uncertainty in the final values is captured completely. For example, following an MCMC one could write down the posterior probability that any given observation was male or female based on the observed height alone. As with all Bayesian procedures, the influence of the prior can be made small relative to the data whenever the sample size is reasonably large, or alternatively, strong informative priors can be used to integrate alternative sources of information when they are available.
There are some nuances to mixture models that can be tricky to overcome - for example the "label-switching problem", or the issue of estimating the number of mixture components - but the basic framework of GMMs is straightforward. The added flexibility that comes from using mixtures, compared with simple parametric forms, is worth the effort in many applications.
The GMM is very closely related to the k-means clustering algorithm, where you cluster the data into k sets. Intuitively, one tries to maximize the distance between different clusters while minimizing the distance of points within each cluster. The EM algorithm is a way to train the GMM (or k-means).
The (high-level) difference between k-means and GMM is that in k-means each point has a "hard" assignment to each cluster, whereas in a GMM there are "soft" assignments, i.e., each data point could have been created by several mixture components (with different probabilities).
Although it was not my question, I learned from the answer. But I have still one problem, so far, about the resulted GMM. If I chose to have 3 clusters, then after running the GMM iteration, I will get 3 mixture model. The problem is how to determine the label for each cluster?
The basic idea is to reorder the labels to minimise a loss function between one iteration and the next. This creates a single thread that passes through all iterations, leading to a single labelling throughout.
You answers were awesome.I would like to extend this question with missing data element.
I'm ambiguous on how the e-step process in EM deal with missing data.
If i have a dataset as provided in the attachment.
Let say, i have the basic code of GMM as in below:
http://tinyheero.github.io/2016/01/03/gmm-em.html
Then,using the code/ basic equation in E-step and m-step, how can i replace the missing values in the course grade vector(as in attachment) via iterative process in EM?
Your help is highly appreciate and million of thanks
The missing data problem is an interesting one. First, it is important to point out that the unknown group that each observations belongs to (i.e. male vs. female in the example I gave above) is sometimes referred to as missing or "incomplete" data. This is what the 'parameter estimation in the "incomplete data" scenario' is talking about in the link you posted (which is an excellent resource btw). But this is a special sort of missing data in the sense that we "invented" the idea of observations belonging to groups and then called that data missing! Mathematically this is often referred to as a latent variable, which basically means a variable that we don't know the value of.
This should be kept conceptually distinct from actual missing values in your data. For example, in the table you attach is looks like you are exploring the ability of different methods to impute missing values in your observed data. Here you are dealing with "real" missing data, in addition to any latent variables in your model.
So if you wanted to include a GMM in your comparison here is what I would do. 1) construct a model that is a mixture of two linear regressions, i.e. each regression has its own slope, intercept and perhaps error variance. I would also create mixture weights as free parameters, with the first regression line having weight p and the second weight q (where p+q=1). 2) construct a latent variable called "group", with each observation belonging to either group1 or group2. The group dictates the regression line that the observation is fitted to. Draw groups from the prior initially. 3) at each stage in your MCMC update the group assignment of each observation by drawing from the conditional posterior. Then update the slope, intercept and variance parameters conditional on the known groupings. Hopefully you will able to find appropriate priors and the corresponding conditional posterior distributions for these steps by Googling around and doing some math. The end result of this process will be a series of draws from the three parameters of each regression line, plus the mixture weights.
If you want to use the same model to impute values here is what I'd do. 1) use all of the steps above on just the data that you have access to, i.e. the observations for which you have course grades. 2) given a complete set of posterior draws under this model you can then impute an entire distribution for each unknown observation. You would do this by drawing from the posterior predictive distribution, given the known math aptitude (again, Google this term if it is new to you). 3) look at these posterior predictive distributions for each imputed variable. If they are complex and multi-modal then report the whole distribution, or if you think they can be summarised by some statistic such as the mean then report that value as the final imputed value.
The explanation above assumes MCMC rather than EM, and assumes a mixture of just two components, but each of these can be relaxed. I should also emphasise that there is no guarantee that this model will fit better than a simple regression (in fact in your case I'd be surprised if it did). But, if there is some strong effect hidden in the data (for example results are from two different schools that have different standards of examination) then it may fit better.