I have a single text file containing amino acid sequence of ~6000 proteins in FASTA format. All proteins belong to a single species, but different strains. I want to determine COG (www.ncbi.nlm.nih.gov/COG) of each protein. I can have two approaches here;

1. First one is simple. Run BLASTp of my query sequences against COG database (pogseqs.fa) available at;

ftp://ftp.ncbi.nlm.nih.gov/pub/kristensen/thousandgenomespogs/blastdb/

and then look for the best hit for each query, and then see hit belongs to which POG group. But what e-value I should use? And what criteria to select best hit, either bit-score or the identity?

2. Second approach is complex: install the COG software (COGsoft.201204.tar) on the Linux system from the following link;

ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/

And then follow the instructions provided in the Readme file (attached). To summarize, first run PSI-BLAST of all-against-all, then manipulate the data in acceptable format to COGnitor by using different modules of COG. Every step requires time in this approach.

Need suggestions.

Am I choosing the right database/tool to determine the orthologous group of genes?

More Muhammad Sufian's questions See All
Similar questions and discussions