I am working on finding the functional signatures/remains of regulatory regions, RNAs and proteins in the intergenic sequences (IGs) of E. coli K12 MG1655. I have a few doubts regarding the demarcation of IGs and inclusion of regulatory regions (esp, promoters) in the IGs. Could you please help me to clarify these two aspects:
1. For my study, I have extracted IGs from the genome using the annotation provided by the NCBI as well as EcoCyc. I have extracted a total of 3718 sequences as IGs. The extraction was done by removing the gene regions from the genome, and what was left with were called IGs. Is this the right approach? If not, what else should I do to get the IGs?
2. My aim is to identify if any sequence or structural similarities are shown by the IGs or their translates with various regulatory RNAs or peptides.
My question is, should I remove the known regulatory regions in E. coli (available in RegulonDB) from the sequences which I extracted as IGs, before going ahead with search for functional signatures/remains of intergenic regions?
My doubt regarding retaining/removing known regulatory regions from IGs is due the following observation:
Regulatory regions have been detected often in regions annotated as a gene in E. coli genome i.e promoters/terminators within a gene. Therefore presence of regulatory regions within a sequence does not eliminate the possibility of finding a functional region in another reading frame, as is observed in the gene regions.
Shall I include the regulatory regions and continue my analysis of finding functional signatures/remains in IGs? Or remove them? If I have to remove them, what explanation shall I give to justify the removal?