Where should I start to analyze DEG from CLC produced Total Count, RPMK, TPM and CPM values?

23 October 2019 2 4K Report

Dear Researchers,

I'm a veterinarian who ended up in the bioinformatic world as I am analyzing the RNA-Seq using samples from colons of 2 groups of dogs; 1 control group of 3 beagles where I have sent 3 colonic samples with histopathology as normal colon mucosa and 1 disease group of 3 dogs where I have sent 1 set of diseased and normal colonic sample for each dog with histopathology as diseases and normal colon mucosa. Basically I have 3 replicates from 2 groups of dogs, but with 3 different conditions (1 condition of normal dog colon, 1 condition of diseased dog with diseased colon, 1 condition of diseased dog with normal colon) making it up to 9 samples, which warrants the use of edgeR since I have less than 12 samples.

The colonic RNA samples were sent for RNA-sequencing in another DNA research lab and I honestly am not sure which protocol they have used; the information received was the sequences were proceed using CLC genomics workbench 12 to produce the excel file with total count, RPKM, TPM and CPM.

Machine used was Illumina HiSeq 2500 with single-read,50bp library; and sequenced with 15 million reads, where FASTQ file was retrieved.

The flow of RNA sequencing was done as below:

1. From FASTQ reads, the Illumina Adaptor was removed

2. From FASTQ reads, the unreliable bases were removed

3. The reads were mapped on the CamFam reference

4. RPKM was calculated.

I am in a dilemma now if I should re-process the raw FASTQ files myself, or by just using the current Excel file; as I do not have access to supercomputers in my facility.

I am trying to analyze the excel output using Rstudio with the EdgeR package for normalization and hopefully DGE analysis, but I'm pretty unsure of at what point my current excel data is at. Will I need to normalize it, or should I not be using the Excel but the raw FASTQ for the EdgeR instead (as the user manual says to not use FPMK in the EdgeR, but the raw data instead). Will total count and total read numbers be suffice to come out as a raw data for EdgeR?

I have up to 20,000 genes to analyze, I wonder if I do have sweet time to start from raw data.

Thanks everyone!

Alex G Lee

Generally I always recommend to realign it yourself so that you know exactly what goes in and out. However you could use the datasheet they give you, if and only if, the "total count" you mentioned is raw whole number counts. Without going too much into details, generally the models, in edgeR, deseq, or even limma expects whole numbers ( think of discrete vs continous), the former is what the model expects since rnaseq counts in whole numbers. Thus you cannot use any type of normalized values like tpm or cpm, etc... If those total counts are indeed raw discrete counts then I recomment using edgeR with TMM normalization. If you are more familiar with limma then you can further put this through limma voom and then run your linear model. Does this makes sense?

Yong Bin Teoh

Alex G Lee thanks for the reply and suggestion! My supervisor told me something pretty similar so we ended up using a commercial program called StrandNGS to perform the alignment and running the edgeR script within the program to get our results.

The criterion about maximally localized wannier function?

Claim co authorship of this article https://doi.org/10.1371/journal.pone.0301672 to add it to my account ?

Seeking Feedback on Methodology for Developing Dynamic, Cost-Aware Resource Allocation Algorithms in Multi-Cloud Environments?

How to verify the correctness of an abaqus model and how to make a thermally coupled model converge to the correct values?

Wannier90 error: Could not find projections block in wannier90.nnkp?

The relationship between Partial charge & Lowdin charge in Quantum espresso.?

How to prevent metal atoms in the active site from forming any bonds with the ligand by using glide in Maestro?

How to improve the nitrogen use efficiency of rice in your contry?

What does it mean meta-analysis P-value, I2 ?

Why does an over-expressed enzyme protein need to be purified to show the catalytic activity?

What is the best software/code for plotting marine transect data?

Are there always been barcodes, apapters and primer sequences in the FASTQ files of NGS?

How can reinforcement learning algorithms be effectively applied to study decision-making processes in neuroscience?

How do the complex dynamics of the oral microbiome interact with the host immune system affect periodontal disease development and treatment response?

What is the association and mechanism between bacterial diversity in the oral microenvironment and clinical manifestations of periodontal disease?

How to select drugs for repurposing?

Use of agency vs. google-translation for translating non-english qualitative data?

Have pathomorphological studies of the lungs been performed in patients with desminopathy?

What is the role of the pharmacist in the management of ischemic heart disease?

What are the links between diets and chronic metabolic diseases?