What is your favorite DEG test for RNA-seq data?

04 April 2013 35 558 Report

To my knowledge there are at least 11 different methods available (http://www.biomedcentral.com/1471-2105/14/91/). What tests do you prefer and for which kind of data/conditions?

Juan P Steibel Popular answer

Just a clarification:

Cufflinks can be used for building transcript models against which obtain counts with HTSeq-count. Our pipeline is: fastq+Tophat+ref genome>.bam files, .bam files+cufflinks> .gtf files (one per library)

.gtf files+reference transcriptome+cufmerge> merged.gtf file (merged annotation)

.bam +merged.gtf +HTSeq-count> annotated count matrix to input in edgeR or DESeq.

As it was said, the FPKM values from cufflinks can't be used in edgeR and DESeq because these packages analyze count data. In theory the FPKM values could be used with limma... after proper normalization. If needed would treat those values as expression levels from a single color microarray and apply some normalization. But I trully think that the nature of RNAseq calls for count-based models.

Kean-Jin Lim

I prefer edgeR. The documentation of edgeR is great. The edgeR can be used to analyze replicates data set (highly recommended) and non-replicate. There are several study cases in the User's guide which provide a comprehensive guide to users. Besides, the authors of edgeR are active answering questions in Bionconductor mailing list.

Christoph Grunau

DESeq if you have replicates. If not DEGSeq.

Andrew D Millard

DESeq , data is from bacterial system

Keunsoo Kang

I have been using TopHat + HTSeq + DESeq and TopHat + Cufflinks. I prefer the DESeq pipeline.

Michal Okoniewski

Very good question. No perfect answer :)

edgeR has now the limma-related functionality, so it is good for complex designs.

DESeq has now the DEXseq extension for splicing analysis.

For comparison of the things "under the bonnet" in those two - check the preprint: http://arxiv.org/abs/1302.3685

You can try also: Cufflinks+Cuffdiff if you trust Cufflinks transcript discovery, and prefere to work on transcript level.

RSEM or BitSeq if you take the strong assumption that you know the complete transcriptome (even in case of human and mouse - anything can be transcribed).

Also the way in what you do the primary analysis (mapping, mapping parameters, counting) - may influence the outcome of differential expression. Yet another thing is the overall library preparation and quality - so the full answer is complex.

Reema Singh

use Cufflink http://cufflinks.cbcb.umd.edu/ . Results can be visualize by using R package cummRbund.

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html#materials

Stefano Campanaro

I use DEGseq. I tried Cufflinks+CuffDiff but I found the calculation of the number of reads mapping on a specific gene do not match exactly with our results obtained with home made scripts. Probably this was fixed in recent versions of the software, I have tested Cufflinks+CuffDiff last year.

Reema Singh

@Stefano . What if someone use cufflinks+ Limma/DEGseq?

Stefano Campanaro

@Reema Singh. Using of cufflinks+Limma/DEGseq could solve the problem but I did not try this strategy. I suggest to verify if the cufflinks step that calculates the number of reads per single transcript is reliable. May be the problem we experienced was related to the identification of multiple transcript isoforms on a large number of genes also in very simple organisms (i.e. yeast). Probably the overestimation of transcript isoforms influenced the number of reads assigned to each single transcript and finally to the genes. Probably this determined the discrepancy between the number of reads per gene identified with our script and those identified by Cufflinks, but this is an hypothesis.

Fabrice Chatonnet

@Stefano & Reema Singh. You can not use cufflink and EdgeR / DESeq since the statistical models behind each program are different. EdgeR / DESeq are only to be used with integer number of counts per gene (or per transcript if you used HTSeq-counts and the option that counts by transcript ID rather than gene ID), and you need not normalize your libraries sizes, since it is done by edgeR / DESeq. Cufflink, CuffDiff and CuffCompare work at the transcript level only and with RPKM counts, assuming that your different libraries are normalized prior to DE analysis.

Personally, I am a great fan of TopHat -> HTSeq -> edgeR. As said above, edgeR is very well documented with nice examples and can sort out analysis with different parameters, like if you treat two strains of animals with a drug.

Paweł P Łabaj

As was already mentioned earlier, if you work with well annotated transcriptome then BitSeq.

Olivier Armant

I vote +1 for HTseq DEseq for people interested at gene expression levels. It is easy to use and give the opportunity to get variance stabilized data suitable for further analysis. For transcripts DE cufflinks-cuffdiff. But as pointed you have many more alternative.

Reema Singh

@Fabrice. Thank you so much for this explanation. I really didn't known about this. But is it fine to use Limma DE for RNASeq after cufflink ?

Juan P Steibel

Just a clarification:

.gtf files+reference transcriptome+cufmerge> merged.gtf file (merged annotation)

.bam +merged.gtf +HTSeq-count> annotated count matrix to input in edgeR or DESeq.

Steffen Priebe

I found a nice paper regarding this topic:

"Comprehensive evaluation of differential expression

analysis methods for RNA-seq data" by Rapaport et al. 2013.

One result is, that NB (Negative binomial) methods like DESeq and edgeR perform quite well.

Christoph Grunau

We use DEseq and DEGseq and are happy with the correlation to qPCR as long as there are no huge differences in test vs. control samples. As others have pointed out the previous version of the Cufflinks package gave us good results for reconstruction of exon-intron structure but did not perform well for differential gene expression. Maybe this was solved in the recent update.

Sandra Regina Maruyama

I am also a great fan of TopHat -> HTSeq -> edgeR for differential expression analysis. edgeR is very well documented and give you robust results.

Carine Genet

We use tophat (alignment) samtools (bam file merging) cufflinks (transcript discovery on merged bam files) sigcufflinks (transcript quantification). Then we use DESeq2 R package (Love et al., 2013) combined with HTSFilter (Rau et al., 2012).

Shishir K Gupta

Hi all, in the absence of biological replicates which tool you will suggest for DEG analysis. I only have two samples the control and the bacterial infected sample. I already have results from DESeq and CuffDiff. What should be the siginificant logFC, q-value and adjp-value threshold is such case?

Shishir K Gupta

Dear Thomas, Thanks for your reply. Yes I also feel I am missing many significant immune genes that in principle, should be overexpress. Which tool you will suggest me to do gene set enrichment analysis? Further, in case if I have to stick with my previous analysis which cut-off of logFC, q-value (CuffDiff) logFC, adjp-value (DESeq) and I should use for my DEG.

Shishir K Gupta

Dear Thomas, we will first go for qRT-PCR experiments for some of genes and will will also use their expression in future insect immunity models

Shishir K Gupta

okay. the genome of my organism was sequenced in 2010 but the assembly missed so many genes including some antimicrobial peptides so we performend RNAseq experiments first to update previous annotations then as we extracted RNA samples from a control and bacterial challange insect we are also interested to identify any new immune related genes and also the expression changes of well known insect immune genes. Hope it would be bit clear! Thanks.

Shishir K Gupta

Dear Thomas, many thanks. If I correctly understand you mean to say about constructing network for GO term overrepresentations for the genes under specific threshold. Unfortunately, I don't think so we can replicate the experiments in current stage.

Shishir K Gupta

Dear Thomas, many thanks for your expert comments. Do you also have a reference of paper where such approach has been implemented before.

David Andrew Eccles

I use Cufflinks/Cuffdiff for looking at isoform differences in eukaryotes with a well-annotated transcriptome / genome (e.g. human, mouse, fly), and DESeq/DESeq2 for prokaryotes or when there are no good reference genomes available (e.g. Neisseria meningitidis, Schmidtea mediterranea).

3-5 replicates per condition would be ideal (emphasis on 5), but I most often do analysis where people have only had enough research funding to do 2 replicates per condition.

Reema Singh

Limma

Fabrice Chatonnet

Now I would like to add something to my previous comment: EdgeR is mostly efficient when you have a lot of replicates (>5), DESeq or DESeq2 are more recommended when you have less replicates. I am currently using DESeq2 a lot (which is a bit less stringent then DESeq) and I have wrapped it into a R function so that my colleagues can easily use it. It works pretty well for DE analysis of gene counts from HT-seq, as said above.

Jose Manuel García-Manteiga

For those who are using tophat - cufflinks - Htseqcount - edgeR/deseq2/limma.

Htseq count discards counts mapping to exons that cannot be unambiguosly assigned to the features to be counted. Let's say you use cufflinks to have transcripts analyzed. If you use the merged.gtf from cufflinks, all reads mapping to exons present in different isoforms will be discarded and hence, differential expression of isoforms will rely only on the numbers of the unique exons for the different isoforms. Do I get it right? If so, isn't it, at least, dangerous? Unfortunately Cuffdiff/cuffdiff2 that tried to use a specific model to test DE at the transcript level did not perform well (too many things to model?) and it seems that the authors now advise to simply use limma(without voom) on log2normalized RPKM (y

Sindre Lee

Have not tried it, but you might use FeatureCounts to get gene length, and then calculate back to counts. And then use the counts into edgeR/DESeq if you like..

Having tried the Tuxedo suite, DEseq2 and edgeR, my personal favourite is TopHat -> HTSeq -> edgeR. The reason is due to edgeR matching our Taqman results the best and edgeR has the absolute best manual and is very easy to use.

Paweł P Łabaj

Dear All,

It is a pleasure to announce that during the Highlight Track of the ISMB 2014 conference we will give a talk where we will present the key findings of the SEQC/MAQC-III Consortium (http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/#MAQC-IIIalsoknownasSEQC).

The main manuscript of the SEQC Consortium:

"A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequence Quality Control consortium"

is already at the copy-editing stage in the Nature Biotechnology and should be available shortly.

We therefore invite you to take part in the talk (HT-PP27 more details below) and following discussion as well as visit our posters adversing selected key results of the study: F45, F46, F47, F48, and N56

PP27 (HT)

Power and Limitations of RNA-Seq: findings from the SEQC (MAQC-III) consortium

Date: Monday, July 14, 11:00 am - 11:25 am

Room: 304

https://www.iscb.org/cms_addon/conferences/ismb2014/paperpresentations.php

Abstract:

We present an extensive multi-centre multi-platform study of the US-FDA MAQC/SEQC-consortium, introducing a landmark RNA-Seq reference dataset comprising 30 billion reads. Several next-generation-sequencing, microarray, and qPCR platforms were examined. The study design features known mixtures, wide-dynamic range ERCC spikes, and a nested replication structure -- together allowing a large variety of complementary benchmarks and metrics. We find that none of the examined technologies can provide a ‘gold standard,’ making the built-in truths of this reference set a critical device for the development and validation of novel or improved algorithms and data processing pipelines. In contrast to absolute expression-levels, for relative expression measures, good inter-site reproducibility and agreement of across platforms could be achieved with additional filtering steps. Comparisons with microarrays identified complementary strengths, with RNA-Seq at sufficient read-depth detecting differential expression more sensitively, and microarrays achieving higher rank-reproducibility. At the gene level, comparable performance was reached at widely varying read-depths, depending on the application scenario. On the other hand, RNA-Seq has heralded a gold-rush for the study of alternative gene-transcripts. Even at read-depths beyond 100 million, we find thousands of novel junctions, with good agreement between platforms. Remarkably, junctions supported by only ~10 reads achieved qPCR validation-rates >80-100%, illustrating the unique discovery power of RNA-Seq. Finally, the modelling approaches for inferring alternative transcripts expression-levels from read counts along a gene can similarly be applied to probes along a gene in high-density next-generation microarrays. We show that this has advantages in quantitative transcript-resolved expression profiling. There is still much to do!

Paweł P Łabaj

The SEQC consortium papers are now available on-line:

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2957.html

Detecting and correcting systematic variation in large-scale RNA sequencing data

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3000.html

The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3001.html

plus complementary one by ABRF consortium:

Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2972.html

Sachin Pundhir

@Jose Manuel García-Manteiga: this is in response to your point on using htseq-count on cufflinks data. It should not be a problem as long as, we have an identical gene id for the overlapping exons. Also, htseq-count consider a read as ambiguous, if it overlaps to two genes rather than exons. The FAQ in this link (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) addresses this in more detail.

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Martin Lewinski

My current favorite:

STAR>Salmon>tximport>edgeR

Hamid Fiuji

Dear friends. thank you so much for all responses. I have read all of them. but i got confused.I ma free researcher and highly interested in transcriptiomics study. I just set up the tuxedo pathway/cuffdiff in my Ubuntu and I want to use cummerbund for differential analyses. on the other hand, I have no experiences for DEseq and edgeR and limma in R studio for more analysis, however newly I followed a github workshop and could set up commands for that treatment. since I have no experiences in the following packages, would you please let me know if I can use just tuxedo pathway for analyzing differential expression, when I have many samples and conditions? if no, I would appreciate if you could explain completely how i can make input file for DEseq limma and edgeR. if I can make input data table, i will be able to continue the pathway based on what i have learnt from github workshop.

for example, how i can get those read counts for DEseq? can I use tophat or cufflinks for the following packages?

I am just trainee and have no sample data for more experiment but because of my high interest I have to use SRA or GEO data.

thank you very much for your help in advance.

Badges
Science topic

More Steffen Priebe's questions See All

How can I estimate gene fold-changes based on counts?

Assuming you have RNA-seq data, different treatments and replicated samples: With RNA-seq data and the multiple ways of counting/normalization of read counts obviously there exist also different...

08 September 2014 8,175 6 View

Is there a problem with my RNA pellet?

Hello, I am currently having problems with RNA extraction. I am using mouse liver (C57BL6J), and I have extracted RNA from mouse liver before. Before this experiment, my final RNA pellets were...

11 August 2024 7,082 3 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

How to confirm the site-directed mutagenesis result without performing NGS?

I'm cloning a fragment of 3200 nts into plasmid. The cloning was successful, however, 02 amino acids were mutated. Now I want to fix these 02 aa by site-directed mutagenesis technique using...

08 August 2024 4,645 2 View

RNA Extraction Using Hot Borate Method No Longer Working?

I've been performing RNA extraction on cotton petiole tissue for a few months now using the method described in the following paper, a derivative of the typical hot borate method...

08 August 2024 9,882 2 View

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

I am currently working on LncRNA; to know the lncRNA-protein interactions I want to do RNA pull down assay, so I need to design primers with T7 promoter. I need assistance in this regard.

07 August 2024 6,622 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

E.coli contamination in human RNA seq data ?

Recently, we observed that 99% of the sequences in our RNA-seq data corresponded to the E. coli genome. Despite multiple DNAse treatments after RNA extraction and ribosomal depletion, we were...

06 August 2024 807 3 View

RNA later for the preservation of RNA in fecal samples at room temperature for one day (37°C)?

I am planning to collect human fecal samples for metatranscriptomic analysis using MGI. These samples are from indigenous people living in a region with high temperatures. I will have access to a...

06 August 2024 1,367 3 View

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Are there any statistical methods to justify your sampling technique using SPSS or AMOS?

05 August 2024 9,153 4 View