I really like working on unsupervised learning problems. They offer a completely different challenge to a supervised learning problem – there’s much more room for experimenting with the data that I have. It’s no wonder that the majority of developments and breakthroughs in the machine learning space are happening in the unsupervised learning domain.
And one of the most popular techniques in unsupervised learning is clustering. It’s a concept we typically learn early on in our machine learning journey and it’s simple enough to grasp. I’m sure you’ve come across or even worked on projects like customer segmentation, market basket analysis, etc.
Let’s take a closer look at how this algorithm works. This will lay the foundational blocks to help you understand where Gaussian Mixture Models will come into play later in this article.
So, we first define the number of groups that we want to divide the population into – that’s the value of k. Based on the number of clusters or groups we want, we then randomly initialize k centroids.
The data points are then assigned to the closest centroid and a cluster is formed. The centroids are then updated and the data points are reassigned. This process goes on iteratively until the location of centroids no longer changes.
Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian distributions, and each of these distributions represent a cluster. Hence, a Gaussian Mixture Model tends to group the data points belonging to a single distribution together.
Different measures and methods can be used, each having a class, among which you can pick one by AIC and BIC. Note, however, the best class of measures depends on your dataset and no universal one may exist.