Usually most of the work in the literature of Expectation-Maximization suppose the input dataset has nature of GMM, but I could not find one which proves that a dataset has GMM distribution before applying EM algorithm.
One reason why GMMs are used without asking the question if the data is GMM distributed lies in the fact that a GMM is a universal function approximator. That is, whatever the original distribution of the data was, when allowing a significant number of mixture components, it is expected that the GMM approaches the true distribution.
Reversely , You can use a normal Dataset and add Gaussian Mixture Model GMM your own. In this way you can customize the percentage of adding Gaussian Mixture. This will be useful to test and analyze the performance of the algorithm.
One reason why GMMs are used without asking the question if the data is GMM distributed lies in the fact that a GMM is a universal function approximator. That is, whatever the original distribution of the data was, when allowing a significant number of mixture components, it is expected that the GMM approaches the true distribution.
A way to have an idea if your data is normally distributed or not, is by experiment (empirically).
Assume that your data is normally distributed. Model it using a GMM. Evaluate the performance of your system. If you get good performance that means your data is high likely normally distributed, else it is not (high likely).
It is the same problem when modeling a process (in real life) by assuming it is Markovian but you have no proof that it is. The only indication that your prior assumptions are good is the result of the system.
I think that you are asking us the existence of statistical test to check a GMM from the dataset without EM-algorithm. I do not know any ways to check it without the EM-algorithm, but, I think that we can check the adequateness of the GMM assumption if we can use the EM-algorithm. Since EM-algorithm is a kind of maximum likelihood estimator for a GMM, we need to use it in order to obtain its parameter estimates. After obtaining parameter estimates, you may use the likelihood ratio test between a Gaussian distribution with sample mean and sample variance and a GMM with parameter estimates obtained from the EM-algorithm. At least, you can evaluate goodness-of-fit of the GMM comparing with the Gaussian distribution model. If you use the Kolmogorov-Smirnov test between the dataset and the GMM with parameter estimates obtained from EM-algorithm, then you can evaluate the adequateness of both the model and parameter estimates at the same time.
As Marco Huber said, GMMs are flexible approximators of density functions (not for general functions --> in this case you need to look at the very similar concept of Gaussian radial basis function networks; but these have no normalization constraints as in densities). The question is not if the density that you would like to model with a GMM originally stems from a GMM. It is more a question of how accurate you want to model it by using a general modeling tool like GMMs. The more components you use, the more accurate your approximation will usually be. There are also concepts to automatically learn the number of components: See e.g. "The Variational Approximation for Bayesian Inference" by Dimitris G. Tzikas et al.
You could also assume different distributions for the experimental data, and perform a goodness of fit test using these distributions (common distributions for mixture models are Gaussian and Laplacian (or double exponential) distributions). Could also use hypothesis testing based on goodness of fit tests to verify if the experimental data is close to either of the considered distributions. Often real data will not fit nicely into either category, so you will need to consider other effects as well (e.g. influence of outliers, effect of skewedness etc) when selecting mixture model.
I am agree with Marco Huber. GMM without any limit on the number of components is absolutely universal and, in my opinion, can not produce unique fit result - one can always add a couple of components and obtain a better fit. Finally, the delta-function is a limit of Gaussian, and every function may be trivially represented as a superposition of delta-functions. So, in my opinion, the fact of successful general GMM decomposition tells nothing about the phenomena nature, as well as, say, Fourier or Tailor universal decompositions.
I had exactly the same question in mind. I applied Peaarson plot to test it. You may have a look: http://en.wikipedia.org/wiki/Pearson_distribution
One good reference for practical application is :
Y. Delignon et. al. (1997). Estimation of generalized mixtures and its application in image segmentation. Image Processing, IEEE Transactions on, 6(10), 1364 -1375.
To test whether data set is normal or not under SPSS click Analyse then Descriptive then plot then normal test. If the p value is significant it indicates that dat is not normal. So for normal one should get non significant result as the null hypothesis is data is normal.
you can plot the histogram of data and see if each mode in the histogram is symmetric, that is mean the data is modeled by Gaussian. Do this test for other data. If all modes are symmetric ==> Normal.
If you want to show the data is modeled by Gaussian, then you have to use statistical test
GMM used as a parametric model of the probability distribution function of continuous measurements, Its parameters are estimated from training data using Expectation-Maximization algorithm.
Arghad, if you may build a Lorenz curve from your data then check if it close to the diagonal of the 1x1 graph; if it is very close then there is a chance.
If you observe that the population with variable over the mean is close to half of it, the chance increases. If it is not close, there is no chance. Thanks, emilio