10 November 2021 2 8K Report

Let us assume that there are 4 gaussian mixtures that have been fitted on a dataset. Each of those four mixtures have their own mean and covariance matrices. Now, an a number of unknown samples are given each of which has a dimension of (n,2). I want to test these samples against the four components and allocate them to a specific component based on a particular value.

To solve this problem, I did some digging around and some of the things I have noticed are as follows:

1. sklearn.mixture.GaussianMixture has methods like predict(X) and score(X[, y]) which can be used to predict the labels or compute the per-sample average log-likelihood of the given data X. However, in both of those, each row corresponds to a single data point and the provided solutions does so for each row of the data X, something which is not acceptable to me. I want a score function or a predict function for the entire data X rather than for each of the rows of the data X.

2. I also found out that KL divergence can be used to test the similarity between two distributions. The lower the KLD value, the better. So, I did that for each of the sample X and the GMMs. Strangely, after doing that, the min for each sample X turns out to be the same. So if there are 10 samples then the min KLD is shown to be with GMM - 4 for each of the 10 samples which corresponds to the fact that each of the 10 samples are similar to GMM-4

I have attached the equation for the closed form KLD expression I have used.

Any idea regarding how this problem can be approached will be appreciated!

More Sumit Pal's questions See All
Similar questions and discussions