What is the ideal sample size for an HCA? I read that this clustering method is best suited for small sample sizes, but would it be appropriate for a sample size of n=340 observations?
The sample size for Hierarchical Clustering Analysis depends on the number of clustering variables and the number of clusters. In the simplest case where clusters are of equal size, Qiu and Joe recommend a sample size at least ten times the number of clustering variables multiplied by the number of clusters. Dolnicar et al. recommend using a sample size of 70 times the number of clustering variables. Overall, researchers should aim for sample sizes of N = 20 to N = 30 per expected subgroup2.
Yes, it is possible to get non-significant results in post hoc tests even if we got significant results in ANOVA. This can happen when the sample size is small or when there is a large amount of variability within groups. In such cases, the post hoc tests may not have enough power to detect differences between groups.
Hierarchical Cluster Analysis (HCA) is a clustering method that aims to group similar observations based on their characteristics. The ideal sample size for an HCA can vary depending on various factors, including the complexity of the data, the number of variables, and the desired level of detail in the resulting clusters.
While HCA is often used with smaller sample sizes, it can still be appropriate for larger sample sizes, such as the n=340 observations you mentioned. In fact, HCA can handle datasets of varying sizes. However, there are a few considerations to keep in mind:
Computational Complexity: As the sample size increases, the computational complexity of the clustering algorithm also increases. HCA involves calculating distances between pairs of observations, so the computation can become more time-consuming and memory-intensive with larger sample sizes.
Interpretability: With a larger sample size, the resulting dendrogram (tree-like structure representing the clusters) may become more complex and difficult to interpret. Identifying meaningful clusters and drawing meaningful insights from the analysis could become challenging.
Variability and Stability: Larger sample sizes can lead to more variability in the data, which can affect the stability of the clustering results. It's important to assess the stability and robustness of the clusters obtained from HCA, particularly with larger sample sizes.
Preprocessing and Feature Selection: With a larger sample size, it becomes even more crucial to carefully preprocess the data and select relevant features. Dimensionality reduction techniques or feature selection methods may be necessary to handle high-dimensional data effectively.
In summary, while HCA is commonly used with smaller sample sizes, it can still be applied to larger datasets like n=340 observations. However, it's essential to consider the computational complexity, interpretability, variability, and preprocessing aspects when using HCA with larger sample sizes.
I don't think you'd have any problem applying any reasonably kind of HCA to 300-odd objects. It will certainly 'work' in the sense of giving you clusters, and it is at least an order of magnitude short of a size that would make a pairwise distance matrix impossible to store. In that sense, 300 _is_ 'small'.
It is probably too big for simple visual interpretation of a complete hierarchical clustering, though, if that's what you were intending - that goes down to individual objects/individuals and it can get hard to see what's really going on.
Otherwise, for a lot of ordinary purposes, 300 is likely to be _big_ enough to be useful , and that's usually the more important problem if you want to draw any inferences or classify other objects later. Others above have pointed to recommendations. 30 (or more) per expected subgroup is a fair rule of thumb. But it does depend on how many variables you have and what the typical ranges are between and within groups. I'll stick my neck out a little and say there's _no_ simple way of getting an 'ideal' number for clustering in advance; you'd have to know quite a lot about your particular population and your intended use to even make a start on the question.
And an important rider, after all that, is to plan for some cross-validation to make sure your clustering isn't just rolling dice ...