The aim:
Use of ORFs predicted by Transdecoder from assembled tarnscriptome, as protein sequence database for proteomic study of non-model organism (proteins identification by mass-spectrometry).
The problem:
Resulting database of predicted ORFs contains significant number of fragments of protein sequences which are identified by MS as different proteins, but virtually are overlapping parts of one protein. FIltering with CD-HIT helps to reduce rebundancy to some extent, but can not fix problem, because you can not set identity threshold lower than 95-97% to save real isoforms of proteins (cytoplasmic and muscle actins for example).
The question:
How to find and assemble these overlapping sequences? Could you suggest concise algoritm and some tools to solve my problem? Note that de novo approach will be better, since non-model organisms may contain unique sequences not present in other objects.
Thanks