To my knowledge there are at least 11 different methods available (http://www.biomedcentral.com/1471-2105/14/91/). What tests do you prefer and for which kind of data/conditions?
Cufflinks can be used for building transcript models against which obtain counts with HTSeq-count. Our pipeline is: fastq+Tophat+ref genome>.bam files, .bam files+cufflinks> .gtf files (one per library)
.bam +merged.gtf +HTSeq-count> annotated count matrix to input in edgeR or DESeq.
As it was said, the FPKM values from cufflinks can't be used in edgeR and DESeq because these packages analyze count data. In theory the FPKM values could be used with limma... after proper normalization. If needed would treat those values as expression levels from a single color microarray and apply some normalization. But I trully think that the nature of RNAseq calls for count-based models.
I prefer edgeR. The documentation of edgeR is great. The edgeR can be used to analyze replicates data set (highly recommended) and non-replicate. There are several study cases in the User's guide which provide a comprehensive guide to users. Besides, the authors of edgeR are active answering questions in Bionconductor mailing list.
edgeR has now the limma-related functionality, so it is good for complex designs.
DESeq has now the DEXseq extension for splicing analysis.
For comparison of the things "under the bonnet" in those two - check the preprint: http://arxiv.org/abs/1302.3685
You can try also: Cufflinks+Cuffdiff if you trust Cufflinks transcript discovery, and prefere to work on transcript level.
RSEM or BitSeq if you take the strong assumption that you know the complete transcriptome (even in case of human and mouse - anything can be transcribed).
Also the way in what you do the primary analysis (mapping, mapping parameters, counting) - may influence the outcome of differential expression. Yet another thing is the overall library preparation and quality - so the full answer is complex.
I use DEGseq. I tried Cufflinks+CuffDiff but I found the calculation of the number of reads mapping on a specific gene do not match exactly with our results obtained with home made scripts. Probably this was fixed in recent versions of the software, I have tested Cufflinks+CuffDiff last year.
@Reema Singh. Using of cufflinks+Limma/DEGseq could solve the problem but I did not try this strategy. I suggest to verify if the cufflinks step that calculates the number of reads per single transcript is reliable. May be the problem we experienced was related to the identification of multiple transcript isoforms on a large number of genes also in very simple organisms (i.e. yeast). Probably the overestimation of transcript isoforms influenced the number of reads assigned to each single transcript and finally to the genes. Probably this determined the discrepancy between the number of reads per gene identified with our script and those identified by Cufflinks, but this is an hypothesis.
@Stefano & Reema Singh. You can not use cufflink and EdgeR / DESeq since the statistical models behind each program are different. EdgeR / DESeq are only to be used with integer number of counts per gene (or per transcript if you used HTSeq-counts and the option that counts by transcript ID rather than gene ID), and you need not normalize your libraries sizes, since it is done by edgeR / DESeq. Cufflink, CuffDiff and CuffCompare work at the transcript level only and with RPKM counts, assuming that your different libraries are normalized prior to DE analysis.
Personally, I am a great fan of TopHat -> HTSeq -> edgeR. As said above, edgeR is very well documented with nice examples and can sort out analysis with different parameters, like if you treat two strains of animals with a drug.
I vote +1 for HTseq DEseq for people interested at gene expression levels. It is easy to use and give the opportunity to get variance stabilized data suitable for further analysis. For transcripts DE cufflinks-cuffdiff. But as pointed you have many more alternative.
Cufflinks can be used for building transcript models against which obtain counts with HTSeq-count. Our pipeline is: fastq+Tophat+ref genome>.bam files, .bam files+cufflinks> .gtf files (one per library)
.bam +merged.gtf +HTSeq-count> annotated count matrix to input in edgeR or DESeq.
As it was said, the FPKM values from cufflinks can't be used in edgeR and DESeq because these packages analyze count data. In theory the FPKM values could be used with limma... after proper normalization. If needed would treat those values as expression levels from a single color microarray and apply some normalization. But I trully think that the nature of RNAseq calls for count-based models.
We use DEseq and DEGseq and are happy with the correlation to qPCR as long as there are no huge differences in test vs. control samples. As others have pointed out the previous version of the Cufflinks package gave us good results for reconstruction of exon-intron structure but did not perform well for differential gene expression. Maybe this was solved in the recent update.
We use tophat (alignment) samtools (bam file merging) cufflinks (transcript discovery on merged bam files) sigcufflinks (transcript quantification). Then we use DESeq2 R package (Love et al., 2013) combined with HTSFilter (Rau et al., 2012).
Hi all, in the absence of biological replicates which tool you will suggest for DEG analysis. I only have two samples the control and the bacterial infected sample. I already have results from DESeq and CuffDiff. What should be the siginificant logFC, q-value and adjp-value threshold is such case?
Dear Thomas, Thanks for your reply. Yes I also feel I am missing many significant immune genes that in principle, should be overexpress. Which tool you will suggest me to do gene set enrichment analysis? Further, in case if I have to stick with my previous analysis which cut-off of logFC, q-value (CuffDiff) logFC, adjp-value (DESeq) and I should use for my DEG.
okay. the genome of my organism was sequenced in 2010 but the assembly missed so many genes including some antimicrobial peptides so we performend RNAseq experiments first to update previous annotations then as we extracted RNA samples from a control and bacterial challange insect we are also interested to identify any new immune related genes and also the expression changes of well known insect immune genes. Hope it would be bit clear! Thanks.
Dear Thomas, many thanks. If I correctly understand you mean to say about constructing network for GO term overrepresentations for the genes under specific threshold. Unfortunately, I don't think so we can replicate the experiments in current stage.
I use Cufflinks/Cuffdiff for looking at isoform differences in eukaryotes with a well-annotated transcriptome / genome (e.g. human, mouse, fly), and DESeq/DESeq2 for prokaryotes or when there are no good reference genomes available (e.g. Neisseria meningitidis, Schmidtea mediterranea).
3-5 replicates per condition would be ideal (emphasis on 5), but I most often do analysis where people have only had enough research funding to do 2 replicates per condition.
Now I would like to add something to my previous comment: EdgeR is mostly efficient when you have a lot of replicates (>5), DESeq or DESeq2 are more recommended when you have less replicates. I am currently using DESeq2 a lot (which is a bit less stringent then DESeq) and I have wrapped it into a R function so that my colleagues can easily use it. It works pretty well for DE analysis of gene counts from HT-seq, as said above.
For those who are using tophat - cufflinks - Htseqcount - edgeR/deseq2/limma.
Htseq count discards counts mapping to exons that cannot be unambiguosly assigned to the features to be counted. Let's say you use cufflinks to have transcripts analyzed. If you use the merged.gtf from cufflinks, all reads mapping to exons present in different isoforms will be discarded and hence, differential expression of isoforms will rely only on the numbers of the unique exons for the different isoforms. Do I get it right? If so, isn't it, at least, dangerous? Unfortunately Cuffdiff/cuffdiff2 that tried to use a specific model to test DE at the transcript level did not perform well (too many things to model?) and it seems that the authors now advise to simply use limma(without voom) on log2normalized RPKM (y
Have not tried it, but you might use FeatureCounts to get gene length, and then calculate back to counts. And then use the counts into edgeR/DESeq if you like..
Having tried the Tuxedo suite, DEseq2 and edgeR, my personal favourite is TopHat -> HTSeq -> edgeR. The reason is due to edgeR matching our Taqman results the best and edgeR has the absolute best manual and is very easy to use.
It is a pleasure to announce that during the Highlight Track of the ISMB 2014 conference we will give a talk where we will present the key findings of the SEQC/MAQC-III Consortium (http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/#MAQC-IIIalsoknownasSEQC).
The main manuscript of the SEQC Consortium:
"A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequence Quality Control consortium"
is already at the copy-editing stage in the Nature Biotechnology and should be available shortly.
We therefore invite you to take part in the talk (HT-PP27 more details below) and following discussion as well as visit our posters adversing selected key results of the study: F45, F46, F47, F48, and N56
PP27 (HT)
Power and Limitations of RNA-Seq: findings from the SEQC (MAQC-III) consortium
We present an extensive multi-centre multi-platform study of the US-FDA MAQC/SEQC-consortium, introducing a landmark RNA-Seq reference dataset comprising 30 billion reads. Several next-generation-sequencing, microarray, and qPCR platforms were examined. The study design features known mixtures, wide-dynamic range ERCC spikes, and a nested replication structure -- together allowing a large variety of complementary benchmarks and metrics. We find that none of the examined technologies can provide a ‘gold standard,’ making the built-in truths of this reference set a critical device for the development and validation of novel or improved algorithms and data processing pipelines. In contrast to absolute expression-levels, for relative expression measures, good inter-site reproducibility and agreement of across platforms could be achieved with additional filtering steps. Comparisons with microarrays identified complementary strengths, with RNA-Seq at sufficient read-depth detecting differential expression more sensitively, and microarrays achieving higher rank-reproducibility. At the gene level, comparable performance was reached at widely varying read-depths, depending on the application scenario. On the other hand, RNA-Seq has heralded a gold-rush for the study of alternative gene-transcripts. Even at read-depths beyond 100 million, we find thousands of novel junctions, with good agreement between platforms. Remarkably, junctions supported by only ~10 reads achieved qPCR validation-rates >80-100%, illustrating the unique discovery power of RNA-Seq. Finally, the modelling approaches for inferring alternative transcripts expression-levels from read counts along a gene can similarly be applied to probes along a gene in high-density next-generation microarrays. We show that this has advantages in quantitative transcript-resolved expression profiling. There is still much to do!
@Jose Manuel García-Manteiga: this is in response to your point on using htseq-count on cufflinks data. It should not be a problem as long as, we have an identical gene id for the overlapping exons. Also, htseq-count consider a read as ambiguous, if it overlaps to two genes rather than exons. The FAQ in this link (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) addresses this in more detail.
Dear friends. thank you so much for all responses. I have read all of them. but i got confused.I ma free researcher and highly interested in transcriptiomics study. I just set up the tuxedo pathway/cuffdiff in my Ubuntu and I want to use cummerbund for differential analyses. on the other hand, I have no experiences for DEseq and edgeR and limma in R studio for more analysis, however newly I followed a github workshop and could set up commands for that treatment. since I have no experiences in the following packages, would you please let me know if I can use just tuxedo pathway for analyzing differential expression, when I have many samples and conditions? if no, I would appreciate if you could explain completely how i can make input file for DEseq limma and edgeR. if I can make input data table, i will be able to continue the pathway based on what i have learnt from github workshop.
for example, how i can get those read counts for DEseq? can I use tophat or cufflinks for the following packages?
I am just trainee and have no sample data for more experiment but because of my high interest I have to use SRA or GEO data.