I am not a specialist in genome annotation procedures. I found a good review "A beginner's guide to eukariotic genome annotation" written by M.Yandell and D. Ence in 2012. They mention that repeats could cause a lot of problems in the annotation process, but going through these processes, I couldn't imagine, how the error I faced now could occur?
I have figured out, that annotation that I have used for my study contains wrong gene coordinates. There is a fasta sequence database of transcripts with gene IDs, and they are true gene transcripts (match to proteins). However, in the genome annotation file, some genes with IDs identical to transcripts IDs, contains hundreds of repeats, that didn't match those transcripts by sequence.
So, when I extracted those 'genes' from genome scaffolds, they didn't match appropriate transcripts, but match, for example 800 other "genes" with different IDs, as well as with one retrotransposon.
So, if there are several procedures of transcripts alignment to the genome scaffolds including 'polishing', why some transcripts at the end didn't match genomic sequence?
In order If I need to know roughly gene start and end coordinates, is it very laborious to BLAST these transcripts by myself with short-blast/ Megablast and extract them? What I should expect?