We're working on generating proteomics datasets of non-model organisms which show considerable diversity between genera and for which there are several gene model sets available, of varying quality. Most proteome manuscripts use a single model organism and a canonical, well-annotated gene model set (e.g. from Uniprot) for peptide searches using MASCOT, SEQUEST, etc.

Is there any formal analysis available on the effects of using multiple combined gene model sets for peptide/protein matches, particularly in cases where the species of the sample and the species of the gene model set are not identical? In our case, I am considering combining several of the available gene model sets to form one large search database which I can use with samples that are different species (either from the same genus or related genera). Are there pitfalls in this approach, assuming one uses a protein-clustering algorithm to collapse redundant sequences? I would think that it may improve the number of proteins identified due to either a better match of the sample peptide and the gene model (due to a diverse set of models in the database) or by filling in gaps in individual gene model databases.

Similar questions and discussions