Suppose you have a set X of dependent variables and a set Y of dependent ones observed on N individuals. So, I have a vague idea that a causal relationship should be validated(or measured, or discovered).

Let's imagine that X are some socio-economic-demographic variables and Y the amount of money spent on different categories of products.

My objective is to segment people in order to predict their average basket of products. I don't know how to fix in advance the number of clusters, let's say K=2,…,6.

I imagine an algorithm that, given a k-means on X (I look for clusters of objects) the partition of objects discovered using X, a partition is induced on Y. After this step, I want to validate if the partition on X is able to give a good partition of Y (in terms of generalization power, of prediction for an unknown object showing only X, say as you want).

I cannot find k-means like approaches of this type. Naturally, I don't want to take X and Y together since this suggests a correlation instead of causation approach, and I connot predict the cluster of an object having only X observed.

Such a causation approach, if exists, would suggest me a rule for defining an optimal number of clusters?

More Antonio Irpino's questions See All
Similar questions and discussions