Currently, I am interested in several (around 100) genes in fish and would like to investigate their expression level using public available RNA-Seq data. My strategy is to build up the reference sequences (interested genes). Index them with bowtie 2 and then align the public available RNA-Seq SRA data (filtered using SRA tool kit) against it. The obtained SAM file was further counted by eXpress for each gene expression level using the FPKM value.
I have several questions about this strategy,
Firstly, when building up the functional gene reference, what kind of sequences should I use if there is no genomic data available? For example, gene A may studied by several scholars and their sequence results can be found in the NCBI Nucleotide database but with difference lengths. Which one should I choose. Besides, RNA splicing proceeded during RNA expression, introns may be spliced out. Therefore, which sequence should I use before or after splicing (this is important because the length of the gene affect the final FPKM value) and how I can identify whether the obtained RNA sequence is spliced or not.
Secondly, is there any problem with the estimated expression level using this strategy? Over or underestimated.
Any other suggestions are strongly welcomed!