I have a single text file containing amino acid sequence of ~6000 proteins in FASTA format. All proteins belong to a single species, but different strains. I want to determine COG (www.ncbi.nlm.nih.gov/COG) of each protein. I can have two approaches here;
1. First one is simple. Run BLASTp of my query sequences against COG database (pogseqs.fa) available at;
ftp://ftp.ncbi.nlm.nih.gov/pub/kristensen/thousandgenomespogs/blastdb/
and then look for the best hit for each query, and then see hit belongs to which POG group. But what e-value I should use? And what criteria to select best hit, either bit-score or the identity?
2. Second approach is complex: install the COG software (COGsoft.201204.tar) on the Linux system from the following link;
ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
And then follow the instructions provided in the Readme file (attached). To summarize, first run PSI-BLAST of all-against-all, then manipulate the data in acceptable format to COGnitor by using different modules of COG. Every step requires time in this approach.
Need suggestions.
Am I choosing the right database/tool to determine the orthologous group of genes?