Anyone use Kmeans clustering method on microarray gene expression data?

More Yuanhang Liu's questions See All

Can anyone suggest me some bioinformatics or medical informatics conference that is held in US except ISCB?

Hey, everyone, Can anyone suggest me bioinformatics or medical informatics related conference that is held in USA? Especially the ones that you went to previously or recommended by your...

03 April 2016 6,560 2 View

Are there any public databases regarding intron retention events in the genome?

The question I hope someone can help me to answer, Firstly, Is there any public database regarding intron retention events in the genome? Secondly, UCSC has splice variants track. Does this...

10 November 2014 498 4 View

What are the large breast cancer data cohorts besides TCGA?

I would like to analyze breast cancer subtypes, and I know that TCGA is a good place to go. Still, I am wondering if there are some other large breast cancer data cohorts. Hopefully it includes...

10 November 2013 2,989 3 View

How to fit a heavy-tail distribution?

I have a RNA-seq count data which I suspect to have a Poisson distribution except that the right tail is quit 'heavy'. Does anyone know some heavy-tailed distribution? It would be better if it...

09 October 2013 7,311 6 View

Transcription factor enrichment tools?

I have a set of genes and I want to know which transcription factors are overrepresented in these genes. I know tools like MEME, but I have to provide the sequence rather than just gene names to...

09 October 2013 3,362 11 View

Maximum likelihood estimation for a bi-model data?

I have an expression data. My hypothesis is that it is a combination of two models, negative binomial(NB) and Poisson. Currently, I'd like to do a simulation. Firstly I want to estimate the...

08 September 2013 4,261 2 View

Could anyone recommend some literature about network modeling?

I want to do a project on network modeling, interest especially lies in gene-gene network and miRNA-mRNA network. Could anyone recommend some literature to start with?

07 August 2013 1,239 3 View

What are the causes of reads mapped to intergenic region in RNA-seq?

After we align our data onto genome using Tophat. I always see reads mapped to the intergenic region. It only accounts for a very small portion though. Could anyone enlighten me on this? What are...

06 July 2013 2,024 20 View

How to determine whether a gene is expressed in RNA-seq?

What statistic test I can use to prove that a gene is expressed in RNA-seq? For example, after RNA-seq, one gene has a RPM of 100, how can I determine this gene is expressed? Another way to think...

05 June 2013 1,709 10 View

Could anyone recommend to me one bioinformatics algorithms book or other related resource?

I am kind of stuck when trying to find new algorithms.

05 June 2013 7,299 4 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Why does my protein refolded to beta sheet during thermal denaturation analysis?

Hi! So i attempted to understand a novel protein behavior towards heat application by analyzing its secondary structure change. I subjected the protein to a thermal denaturation analysis using...

06 August 2024 1,989 3 View

Daniel M Cohen

Yes, k-means is a good clustering method that is commonly used to analyze microarray expression datasets. However, your question is unclear- are your referring to normalized signal on a per probeset basis? I do not know what you mean by "standardized". Clustering algorithms can correlate gene expression either based on absolute expression levels (strong vs. weak) or expression trends (induced vs. repressed). Whether a normalization step is useful or not depends on whether you want to focus your analysis on relative or absolute expression levels.

Of course, there are other forms of normalization when referring to the whole array dataset (RMA), and data transformations (most people prefer to work with log-2 transformed expression data), so you will need to clarify your question to get more valuable feedback here.

Yuanhang Liu

Thanks for your reply, Daniel!. I have already done the normalization and transformation before clustering the genes. Next step, I want to 'center' the genes by subtracting the mean across all samples. After this, I will carry on with kmeans clustering. Do you know how to determine the number of clusters that need to be specified in kmeans algorithm?

Lesitha J

K-means applies a top-down approach, which helps in a very narrow clustering of the data.

This is done empirically. More clusters will give you improved resolution of related gene groupings, but increases computational time and the complexity of your output. Most people try several different cluster numbers, until they observe that no additional interesting biological trends are being highlighted/discovered. For example, for a 4-condition microarray (2x2 matrix of variables), I've seen anywhere between 8 and 20 clusters used. I believe there are some statistical approaches towards determining an optimal number of clusters for a given dataset based on correlation scores achieved in the different clusters. However, I think it is fine to play around with different setting and follow your scientific intuition.

Nivedita Rai

"R package" is good package to handle specially microarray data. where you can do kmean clustering and also find out the number of cluster and many more option it has.

Paul I Costea

There are a couple of different ways of determining the number of clusters that best fit your data. I suggest having a look at the Calinski-Harabasz index as well as the Silhouette index (there are R packages for computing both of these) to get an idea of what the best number of clusters might me.

Thanks for your reply, Paul! I found two packages fpc and clustersim. Some people argued that the two packages gave different results. Which one do you prefer?

I would go for "fpc". 'clustersim' uses more than just the distance matrix as input which strikes me as a bit weird. Namely, it asks for the data from which the distance is computed and i think assumes your distance is euclidean. I for one don't use euclidean that much and with clustersim i sometimes get negative CH indexes.

Thanks, Paul!

Some more things just came to mind.

Firstly, is you want to check out the CH index and how it works (i had a hard time finding the original paper and there's no wiki article :( ) have a look at: http://www.tandfonline.com/doi/abs/10.1080/03610927408827101

(I'm posting the link here, because there's a typo in the author name which makes it hard to find)

Secondly and more relevant, you might want to consider a small variation to the silhouette index if you're going to use it. Namely, the fpc implementation will give you the best cluster number based on the mean of the indexes of all points. (since the silhouette index will compute one value per point). If your clusters aren't very clear-cut (which, they usually aren't) you might want to use the median instead. For this i recommend the silhouette function in the cluster package. Results are the same as in the fpc, but it's easier to get the intermediate values.

Good, I certainly will try that!

Good. I certainly will try that