How to perform differential gene expression analysis for a DESeq2 dataset with multiple factor levels?

10 October 2019 4 8K Report

I'm a novice to RNA-seq analysis and DGE, hoping to get some good ideas moving forward.

For context: I have 15 samples, 3 replicates each of 5 sample types. They're mRNA samples obtained from bacterial cultures enriched on minimal media with different carbon sources for each sample type, using a starting inoculum of field soil. We've aligned the reads back to a metagenome taken from the same soil and now I have a dataset consisting of about 74000 transcripts. I've used DESeq2 to normalize the dataset and have been looking at the patterns within using R.

Question 1: how (if at all) should I filter out the initial dataset to remove low-abundance, low-information transcripts? One method I've tried is to remove transcripts that have above 1 or 3 zeros across all 15 samples, but I'd like to know if there's a more optimal approach, as I think this will help with my confidence down the line.

Question 2: how to conduct differential gene expression in the context of my specific design? I want to determine which transcripts are significantly enriched (or depleted) in each of my 5 sample types, particularly if there are some transcripts (or KEGG categories / subcategories, as I have that information as well) that are specifically enriched in one or two sample types relative to the others. One method I've tried is to compare the transcript expression values for each sample to the mean expression (i.e. making a new column of mean values in the DESeq object and calling results() on it using the contrast of my sample type over the mean). However, even after DESeq2 normalization, there appears to be some variation in sampling depth between my sample types that is biasing what I'm finding. If I perform variance-stabilizing transformation on the dataset and look at expression patterns on a heatmap, I can clearly see clusters of transcripts that are more associated with one or two sample types than others, but I'd like to know what they are more precisely and with statistical significance.

Thanks so much for any help you can give me! If there's any more clarification I can provide, I'm happy to.

Antonio Federico

Hi Dan!

(Little premise: I never worked with bacterial-derived sequencing neither metagenomic data, so my answers are of general nature).

Question 1: I would not say that your approach is wrong but there are more reliable methods around. For instance you could use cpm, proportion test or wilcoxon test based filtering. There is an easy-to-use implementation of these methods in the NOISeq package (filtered.data function).

Question 2: I am not sure I understood well your problem, but since you talk about "enrichment" (enrichment=differential expression?) I guess that you want to fish out the genes which are deregulated in one or two conditions in respect of the others. If this is the case, the only way I see is the pairwise comparison among all of your conditions in DESeq2 and then compare the results. But to perform pairwise comparisons you need to define which one is the baseline condition (aka control). I'm not sure if comparing each condition with the mean expression is the best way to go, it could flatten the differences among the conditions. Sorry if I misunderstood something. i hope it helped at least a bit.

Good luck!

Umar Niazi

You can find an example workflow we use at the git repository: https://github.com/uhkniazi/BRC_SupernumeraryTooth_Gui_PID_19

Answer to your first question https://github.com/uhkniazi/BRC_SupernumeraryTooth_Gui_PID_19/blob/5d1fb0f56b0bc107c5c297d8fc0f2a57563a2e31/09_exploratoryAnalysis.R#L59

do perform some exploratory analysis of the count matrix - more details can be found here: https://laplacebayes.wordpress.com/2017/06/02/compare-transformations-batch-effects-in-omics-data/

As to your question 2, Antonio Federico has already answered that. We tend to use Stan for our modelling purposes, and we use DESeq as a first pass quicker analysis. https://github.com/uhkniazi/BRC_SupernumeraryTooth_Gui_PID_19/blob/5d1fb0f56b0bc107c5c297d8fc0f2a57563a2e31/10_deAnalysis.R#L239

If your design becomes too complex e.g. nested factors, then DESeq may not even work for your analysis due to singularity issues. That is why we tend to use Stan for our modelling tasks.

Justin G. A. Whitehill

We had a similar challenge with an experiment we performed in which we had 2 genotypes X 3 treatments X 4 biological replicates. We worked closely with a biostatistician professor that no works at R. We had planned to publish the methods in a separate paper but never got around to it unfortunately. All of the relevant scripts should be on Github. The paper doi is 10.1111/nph.15477. I also have a copy on my research gate profile. Hope this gets to your problem.

Fabrice Chatonnet

Hello Dan,

for your first question, I would recommend the HTSFilter package that uses statistical inference about your data set to give a reliable cutoff in terms of low expressed genes. It even provides a method to directly get the filtered normalized read counts, it's easy to use.

For your second question, I would go either (or both) for comparison of every sample type again the others (10 comparisons) or use the heatmaps you generated to perform a DEG analysis with a new experiment table with a new factor column, in which you group together sample types that you deem close, as "new sample type 1" and all the others as "new sample type 2". Repeat that step with as many comparisons / combinations of sample types you estimate legitimate (based on your heatmaps). One easy way to have an idea if a gene is DE, at least in one comparison is also to perform an global analysis on your DESeq dataset before looking for comparisons by using the likelihood ratio test method, which is a kind of ANOVA for DESeq2: dds = DESeq(dds, method = "LRT", reduced = ~1).

Hope that helps and good luck!

Is there a problem with my RNA pellet?

Strugglling with m6A dot blot any suugesstion ?

Why Do TDS and EC Increase with Larger Wastewater Volumes, While BOD and COD Decrease?

Could it be a cell culture contamination?

RNA Extraction Using Hot Borate Method No Longer Working?

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

Is there any way to quantify bacterial and fungal cells in their mixed culture?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Why activated CAR-Jurkat cell could not kill targets?

E.coli contamination in human RNA seq data ?