I did the MD simulations of my interest of protein with some organic molcules like Riboflavain and they making cluster near the protein and i want to find different cluster size vs probability distribution graph of organic molecule.
you can use Hierarchical clustering following these steps.
Perform Hierarchical Clustering: Apply the Hierarchical Clustering algorithm to your dataset of organic molecules near the protein. This algorithm builds a hierarchical structure of clusters by iteratively merging or splitting clusters based on their similarity.
Determine the Number of Clusters: Use a suitable method, such as the dendrogram or a criteria like the elbow method or silhouette score, to determine the optimal number of clusters in the hierarchical structure. This will help you define distinct clusters for further analysis.
Assign Cluster Labels: Based on the identified number of clusters, assign cluster labels to each organic molecule in your dataset. Each molecule will be associated with the label of the cluster it belongs to.
Calculate Cluster Sizes: Count the number of organic molecules in each cluster to determine the cluster sizes. This step will provide you with the count or size of each individual cluster.
Probability Distribution: Plot a probability distribution graph to visualize the frequency or probability of different cluster sizes. The x-axis represents the cluster sizes, and the y-axis represents the probability or frequency of occurrence of each cluster size.
Normalize the Distribution: Normalize the probability distribution by dividing the frequency or count of each cluster size by the total number of clusters. This normalization step will give you the relative probability of each cluster size, facilitating comparison and interpretation.
Plot the Graph: Generate a graph that depicts the probability distribution of different cluster sizes. You can use a bar plot, histogram, or line plot to visualize the distribution.
Interpretation: Analyze the probability distribution graph to understand the different cluster sizes and their probabilities. Look for peaks or modes in the distribution that indicate the most common or dominant cluster sizes. Calculate statistical measures such as mean, median, and standard deviation to further characterize the cluster size distribution.
there are other type of clustering such as K-means Clustering but this consider a traditional method, so if you doing these steps for publishing a paper it would be better to use different method than k - means, or build k-means with Hierarchical Clustering, Gaussian Mixture Models (GMM), then compare the result of all the three methods.
if you are doing that for homework just use k - means