I have a NGS RNA seq data and would like to search RNAs and their expression profile. I would like to know what is the shortest RNA (shorter than miRNAs) that can be identified as a RNA species.
I don't think anyone can answer this question with some figures, which you are interested in. And it is not so logical either.
Some things about expression analysis. Expression analysis is not done for RNA but with RNA to see the expression profile of any gene. Second, transcriptome analysis is based on the reference and alignment. Even for the de novo assemble and annotation, some sort of knowledge of the genome and gene structure is necessary.
And what do you expect with RNA seq data depends much on what you know already and what you are expecting.
In your case, what do you expect with the data smaller than 22 nucleotides?
If you have such a small reads, instead of using that data, you need to think why your data is so small? And would it be logical to use such a data in first place for any kind of analysis.
Or if you are asking about any new RNA species detection or discovery, you need advice from some expert in the field of RNA.
I second Abhijeet Singh , the answer strongly depends on the system or sample you are looking at. I have had an experiment where we cared about a short abortive transcript of min. 8 nucleotides in bacteria. In our example we knew what we were looking for and used in vitro polyA tailing to generate meaningful reads. But this is hardly what you are asking for.
Generally, to for transcripts of sizes smaller than miRNA you would need to have some kind of indication, what type of RNA you are looking for and map against these transcripts only. I would be cautious though, since very short sequences in complex organisms can often map to thousands of loci just by chance.
Statistically speaking, it all narrows down to which permutations (with repetition) you will be able to uniquely map to the reference genome/transcriptome. Mouse genome contain 2.5 x 109 bp. Which means, that to get a unique permutation (like PCR primer, for example), with A/T/G/C, you need at least 16bp (4.295 x 109 ; Pk(n)=nk).
Thank You Abhijeet Singh Clemens Thölken and Michail Yekelchyk for adding answers. Apologies if my question was not properly written. So theoretically the sequence length should be 16bp or above in case of mouse to have confidence in the mapping.