Hi, I’m trying to develop an R package to clustering multiples genes obtained from multiples genomes (all vs all genomes), and I’m using blast-2.2.29; my problem is that for some genes that present some differences in the internal zone, as example the rtxA gene (Vibrio vulnificus) present 4 variants, 2 of them, the rtxA-C and rtxA-M variant are almost equal, the M type present ~14103 bp, ~1518 bp less than C type (~15621), the differences is due to the absent of one domain in the M type; in the alignment will see almost an entire alignment except a big gap (of 1518 bp) corresponding to the absent domain (PMT C1/C2) in M type that is present C type (this is the link of all the variants of the rtxA gene in the figure 1: https://www.pnas.org/doi/full/10.1073/pnas.1014339108).

The problem is that when I use blast the result of the alignment is presented from the nucleotide 1 to ~9482, just before the gap and not the entire gene, so the values of alignment as the percent of identity, length, query coverage (pident, length, qcovs) among others do not represent the values of the full alignment among both variants of the gene, just part of it (1-9482); this problem is not presented if blast detect others genes from the same variant, I mean all of the C type that presented almost the same length (15621 bp) with 10-30% of differences (SNPs) among them but similar length, in this case the alignment values are generated from the full gene (~15600 bp) and not a partial.

So I was wonder if blast presented an option to force the full aliment among 2 genes and not a partial match !!!!

This is my code:

blastn -db DB.fasta -query mygenome_genes.fasta -out result.txt -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send qlen slen evalue bitscore qcovs"

and the some of the obtained results (just part of it)

qseqid sseqid pident length qcovs

1 IMEHDJCA_00194 DIBHEKPI_00264 100.00 15621 100

2 IMEHDJCA_00194 IMEHDJCA_00194 100.00 15621 100

3 IMEHDJCA_00194 MDIJOING_00449 97.49 15621 100

21 IMEHDJCA_00194 CPLOIJDP_00135 97.22 9482 87

The problem is the gene with id CPLOIJDP_00135 (last) were used just 9482 of the ~14103 nucleotides, it generate more differences that expected; I just want that blast use the entire gene length in the comparison. In the list the first 3 ids in sseqid are C type (DIBHEKPI_00264, IMEHDJCA_00194, MDIJOING_00449, 15621 bp) and the last one is M type (CPLOIJDP_00135, 14103 bp).

Any suggestion, thanks

More Abraham Guerrero's questions See All
Similar questions and discussions