What do you mean that binary valued data ? How many features (coefficients) describes these objects ? Is this a set of values, which each can set only two values (e.x. true or false) ?
You can try :
1) Hierarchical cluster
2) Two-step cluster method (for example: Zhang, T., Ramakrishnan, R., & Livny, M. (1996, June).
BIRCH: an efficient data clustering method for very large databases.
In ACM SIGMOD Record (Vol. 25, No. 2, pp. 103-114). ACM.)
thnx for the suggestion.. Yes my dataset attributes have values of true or false only.
I am searching an algorithm that can make clusters of equall number of elements.
I have used hierarchical (bolth aglo and diana) with different dissimilarity matrix but the clusters I am getting are sparse (not balanced).. that is my major problem
When a set of binary values are provided instead of numerical, then the Euclidean distance would provide a poor measure of similarity. You could use the Jaccard distance or cosine distance metric to measure similarity.
You could then use any of the clustering algorithm as specified by Dr. Jopek, as the measure of distance itself in all of the above is only a meta characteristic, it need not to be necessarily Euclidean distance (as it is always used by default).
I would suggest using hierarchical clustering with the distance measure to be Jaccard distance. The choice of whether to use single-link or complete-link would be up to you and the nature of the data set.
Yes, agree with Lukasz Jopek, Hamming distance and simple matching coefficient algorithm is not a good idea, you can use the Jaccard index for clustering binary data.
Yes thats true I agree with all. But my major problem is that I want that my clusters should be balanced. From above discussed algorithms I am getting un balanced clusters
1) Use of cosine similarity or Jaccard Coefficient is more preferential in this case, the prime reason being the inability of measures like euclidean to capture dissimilarity in these case.
2) Clustering is used to understand data distribution properties, inherently present in the data. If after using a lot of measures, you are getting the same quality of the result, the reason can be they are all capturing same distribution. In this case, I will suggest following remedies:-
a) One of the important principles in data mining is, to let data present itself it's properties and then draws conclusions about it. Clustering as a basic data understanding mechanism helps to understand properties of data. If for all the clustering mechanism you tried you are getting the same result, then your data has that property and you shouldn't change the property based on the conclusion.
b) However if you feel from your ground truth knowledge, that the nature of clusters is not resembling your understanding of data, I'll suggest use of multi-clustering (http://dme.rwth-aachen.de/sites/default/files/public_files/dmcs-icml2013.pdf) to see multiple facets of results that can be observed or subspace clustering(http://cis.jhu.edu/~rvidal/publications/SPM-Tutorial-Final.pdf) to see impact of different dimensions on clustering (if some dimensions are disrupting your ground truth structure of clusters)
3) Try using spectral clustering which generates two clusters and cluto which has a fraction factor, that can be used to tune clusters.