Yes, k-means is a good clustering method that is commonly used to analyze microarray expression datasets. However, your question is unclear- are your referring to normalized signal on a per probeset basis? I do not know what you mean by "standardized". Clustering algorithms can correlate gene expression either based on absolute expression levels (strong vs. weak) or expression trends (induced vs. repressed). Whether a normalization step is useful or not depends on whether you want to focus your analysis on relative or absolute expression levels.
Of course, there are other forms of normalization when referring to the whole array dataset (RMA), and data transformations (most people prefer to work with log-2 transformed expression data), so you will need to clarify your question to get more valuable feedback here.
Thanks for your reply, Daniel!. I have already done the normalization and transformation before clustering the genes. Next step, I want to 'center' the genes by subtracting the mean across all samples. After this, I will carry on with kmeans clustering. Do you know how to determine the number of clusters that need to be specified in kmeans algorithm?
This is done empirically. More clusters will give you improved resolution of related gene groupings, but increases computational time and the complexity of your output. Most people try several different cluster numbers, until they observe that no additional interesting biological trends are being highlighted/discovered. For example, for a 4-condition microarray (2x2 matrix of variables), I've seen anywhere between 8 and 20 clusters used. I believe there are some statistical approaches towards determining an optimal number of clusters for a given dataset based on correlation scores achieved in the different clusters. However, I think it is fine to play around with different setting and follow your scientific intuition.
"R package" is good package to handle specially microarray data. where you can do kmean clustering and also find out the number of cluster and many more option it has.
There are a couple of different ways of determining the number of clusters that best fit your data. I suggest having a look at the Calinski-Harabasz index as well as the Silhouette index (there are R packages for computing both of these) to get an idea of what the best number of clusters might me.
Thanks for your reply, Paul! I found two packages fpc and clustersim. Some people argued that the two packages gave different results. Which one do you prefer?
I would go for "fpc". 'clustersim' uses more than just the distance matrix as input which strikes me as a bit weird. Namely, it asks for the data from which the distance is computed and i think assumes your distance is euclidean. I for one don't use euclidean that much and with clustersim i sometimes get negative CH indexes.
Firstly, is you want to check out the CH index and how it works (i had a hard time finding the original paper and there's no wiki article :( ) have a look at: http://www.tandfonline.com/doi/abs/10.1080/03610927408827101
(I'm posting the link here, because there's a typo in the author name which makes it hard to find)
Secondly and more relevant, you might want to consider a small variation to the silhouette index if you're going to use it. Namely, the fpc implementation will give you the best cluster number based on the mean of the indexes of all points. (since the silhouette index will compute one value per point). If your clusters aren't very clear-cut (which, they usually aren't) you might want to use the median instead. For this i recommend the silhouette function in the cluster package. Results are the same as in the fpc, but it's easier to get the intermediate values.