Our lab has sent rat cardiac tissue for sequencing and have obtained indigestible fastq data files. Is there a software I can use to organize these fastq sequence files in order to obtain meaningful results?
You need to download your files first. Second, align your mapping reads to the reference genome (I used the Tuxedo package and edge R). Third, you need to calculate gene expression and get the DGE.
What I have done is:
1. Raw RNA-seq reads were mapped to the mus genome using Bowtie and Tophat. Bowtie stores the reference genome sequence in FM index structure that allows searching this sequence rapidly. Bowtie aligns reads to the reference genome using the FM index at rate of tens of millions of CPU hr. Bowtie is able to align short reads only, so it cannot align reads that have big gaps such as reads that have introns. Tophat is another aligner, and it uses to find transcript splice sites. Tophat aligns reads to the reference genome using Bowtie as an algorithm core. Tophat breaks up reads that have big gaps into smaller reads called segments so they will be aligned to the genome. When several of segments align to the genome between 100 bp and several hundred kilobases from one another, Tophat infers the read spans a splice junction and estimates where the splice sites are. Polymorphisms can be identified by the mismatches, insertions, and deletions in the alignment. Aligned reads also can be used to quantify gene and transcripts expression, since the number of reads of a transcript is proportional to its abundance.
2. The files that resulted from Bowie and Tophat then ran through Cuffdiff. Cuffdiff calculates gene expression and figures the statistical significance of observed change in expression in two or more samples. Cuffdiff assumes that the number of reads of a transcript is proportional to its abundance. Cuffdiff allows applying multiple replicates per condition. Cuffdiff output files contain gene expression level changes (fold change (log2 scale)), P value (raw and corrected for multiple testing), gene name, and gene location in the genome.
3. Cuffdiff output files then ran through CummeRbund. CummeRbund runs Cuffdiff data through R statistical environment, cluster, and plot expression data.
There are two main approaches to use RNA-Seq data. Depending on your research question you could do
(1) a transcriptome assembly (e.g. via trinity) or
(2) a read mapping against a reference genome sequence (e.g. via STAR, HISAT2).
However, this is just a very brief description of the two most frequently applied approaches. Working with tools for NGS data takes some time. Interpretation of the results requires a certain knowledge about the methods. If you do not know anything about the before mentioned tools, you should seek help from a bioinformatics core facility.
You need to download your files first. Second, align your mapping reads to the reference genome (I used the Tuxedo package and edge R). Third, you need to calculate gene expression and get the DGE.
What I have done is:
1. Raw RNA-seq reads were mapped to the mus genome using Bowtie and Tophat. Bowtie stores the reference genome sequence in FM index structure that allows searching this sequence rapidly. Bowtie aligns reads to the reference genome using the FM index at rate of tens of millions of CPU hr. Bowtie is able to align short reads only, so it cannot align reads that have big gaps such as reads that have introns. Tophat is another aligner, and it uses to find transcript splice sites. Tophat aligns reads to the reference genome using Bowtie as an algorithm core. Tophat breaks up reads that have big gaps into smaller reads called segments so they will be aligned to the genome. When several of segments align to the genome between 100 bp and several hundred kilobases from one another, Tophat infers the read spans a splice junction and estimates where the splice sites are. Polymorphisms can be identified by the mismatches, insertions, and deletions in the alignment. Aligned reads also can be used to quantify gene and transcripts expression, since the number of reads of a transcript is proportional to its abundance.
2. The files that resulted from Bowie and Tophat then ran through Cuffdiff. Cuffdiff calculates gene expression and figures the statistical significance of observed change in expression in two or more samples. Cuffdiff assumes that the number of reads of a transcript is proportional to its abundance. Cuffdiff allows applying multiple replicates per condition. Cuffdiff output files contain gene expression level changes (fold change (log2 scale)), P value (raw and corrected for multiple testing), gene name, and gene location in the genome.
3. Cuffdiff output files then ran through CummeRbund. CummeRbund runs Cuffdiff data through R statistical environment, cluster, and plot expression data.