Hi there,

I would like to share with you a problem that I commonly encounter on NCBI's BLAST web server:

I would like to recover the homologous nucleotide sequences of a gene of interest (presumed single copy) within a microbial taxon (most often within a genus).

As a starting sequence, I use a protein sequence and not a nucleotide sequence: if I use a DNA sequence from one of the species of the genus studied, I risk missing the homologous genes in the most distant species within the genre. So I perform a TBLASTN.

When I perform this TBLASTN on the NCBI BLAST server, I adapt the parameters so as to target the taxon studied, to recover a maximum of sequences (I target 1000 hits or even 5000 hits depending on the situation) and I use a fairly stringent e-value threshold (I do tests before) which ranges from e-50 to e-100 or even more stringent.

As databases, I target either the NR base or the WGS base depending on the situation. I specify it in case I am asked the question but it does not change much to the problem encountered.

The results displayed by the server seem to be suitable most of the time and I have in front of me the list of strains which have been fully sequenced to date and for which a sequence homologous to my sequence of interest has been found by the algorithm.

On the other hand, when I try to download the fasta file of the aligned sequences, I recover a number of sequences much larger than what was displayed. Tests have shown me that this number depends on the stringency chosen in the parameters. For example, if I keep the default e-value settings (0.05), I can potentially recover over 100,000 sequences with sometimes over 100 sequences per accession!

Already, if you have a trick to recover only the most homologous sequences, I'm interested. But my main problem is this:

In the recovered Fasta file, the downloaded sequences are not ordered according to their e-value, their homology but according to their position in the genome (or contig). It would have been possible for me in a few clicks to sort the hits to retrieve only the first of the sequences for each accession, but under these conditions, the sorting is too laborious.

Have you faced this problem and if so, how did you solve it (if you solved it…) ?

Thank you for your attention.

More Thomas Guiraud's questions See All
Similar questions and discussions