I have 15 years panel or longitudinal data (measure of monthly vegetation indices across months and over years). I want to try both multivariate timeseries (tried the sktime kmeans clustering) as well as simplified k-means. The question is with regards to variables for the simplified k-means.
I have simplified the data by taking mean and standard deviation of all 12 months (e.g. Jan_Mean, Feb_mean,....Dec_mean & Jan_sd, Feb_sd,.... Dec_sd - total 24 variables). The simplified data resembles to that of multivariate timeseries.
My aim is to cluster the areas which have similar growing conditions. I am using k-means for the same. However, the new data structure have a high correlation coefficient between months and this expected because of the Phenology (Plant growth starts when temperature reaches >=6 degrees in Spring and reaches its peak in July and then goes down by the end of the year). I dont want to lose the importance of each month.
However, k-means has difficulty managing columns with a high correlation coefficient, as it gives more weight to those columns. This can be corrected by substituting Mahalanobis distance for Euclidean distance, but due to the complexity of the calculation, I cannot employ this solution. Therefore, I require your help in addressing the correlation issue so that I can reduce correlation and use the default distance.
I have attempted to aggregate the months by developing seasons/Phenology (by taking the mean and standard deviation) - despite this, there is still a high correlation between some seasons, whereas the correlation between the majority of seasons is moderate (>0.7).
Can I take a month difference (current-previous) to significantly minimize the correlation?
As the data is of the panel form, I would also like to include within-year variations (annual mean and standard deviation e.g. Year2001_mean, Year2002_mean,.....)?
I sincerely need your advice.