How to force blast to generate results with the entire genes ?

29 December 2022 4 3K Report

Hi, I’m trying to develop an R package to clustering multiples genes obtained from multiples genomes (all vs all genomes), and I’m using blast-2.2.29; my problem is that for some genes that present some differences in the internal zone, as example the rtxA gene (Vibrio vulnificus) present 4 variants, 2 of them, the rtxA-C and rtxA-M variant are almost equal, the M type present ~14103 bp, ~1518 bp less than C type (~15621), the differences is due to the absent of one domain in the M type; in the alignment will see almost an entire alignment except a big gap (of 1518 bp) corresponding to the absent domain (PMT C1/C2) in M type that is present C type (this is the link of all the variants of the rtxA gene in the figure 1: https://www.pnas.org/doi/full/10.1073/pnas.1014339108).

The problem is that when I use blast the result of the alignment is presented from the nucleotide 1 to ~9482, just before the gap and not the entire gene, so the values of alignment as the percent of identity, length, query coverage (pident, length, qcovs) among others do not represent the values of the full alignment among both variants of the gene, just part of it (1-9482); this problem is not presented if blast detect others genes from the same variant, I mean all of the C type that presented almost the same length (15621 bp) with 10-30% of differences (SNPs) among them but similar length, in this case the alignment values are generated from the full gene (~15600 bp) and not a partial.

So I was wonder if blast presented an option to force the full aliment among 2 genes and not a partial match !!!!

This is my code:

blastn -db DB.fasta -query mygenome_genes.fasta -out result.txt -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send qlen slen evalue bitscore qcovs"

and the some of the obtained results (just part of it)

qseqid sseqid pident length qcovs

1 IMEHDJCA_00194 DIBHEKPI_00264 100.00 15621 100

2 IMEHDJCA_00194 IMEHDJCA_00194 100.00 15621 100

3 IMEHDJCA_00194 MDIJOING_00449 97.49 15621 100

21 IMEHDJCA_00194 CPLOIJDP_00135 97.22 9482 87

The problem is the gene with id CPLOIJDP_00135 (last) were used just 9482 of the ~14103 nucleotides, it generate more differences that expected; I just want that blast use the entire gene length in the comparison. In the list the first 3 ids in sseqid are C type (DIBHEKPI_00264, IMEHDJCA_00194, MDIJOING_00449, 15621 bp) and the last one is M type (CPLOIJDP_00135, 14103 bp).

Any suggestion, thanks

Abhijeet Singh

Check the options under *** Restrict search or results

and you can filter out results based on different criteria as you want.

Loubna Youssar

as Abhijeet Singh pointed out, you can always change the parameters for the blast. Now the question is, does it make sense to change the parameter to force sequences to be included on the multiple alignment? It will provide a worse p_value.

Brian Thomas Foley

Loubna Youssar yes, in this case the gap is a single deletion (or insertion) of about 1518 bases in some of the species. The flanking regions on either side are highly conserved. It would not be right to treat this as 1518 separate mutation events, it is a one-time insertion or deletion event, but it is good to align and analyze the rest of the gene on either side of this site.

Abraham Guerrero

thanks so much all of you. The main problem is that blast do not make the full alignment of an gene that presented a big gap, but I solve it with a R code, if any gene that don't make full alignment, I add a code with R package to compare the full alignment if don't present a full alignment among 2 sequences !!!..... At the moment I didn't know how to make it, but I just made it. Thanks so much !!!!

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

How to generate a citation of my paper from ResearchGate?

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

How to fix background error in rietveld refinement of one XRD peak using GSAS-II?

How can I add own Henry coefficients in Aspen Plus?

Why might the impedance values for DI water and 0.1X PBS buffer solution exhibit a decreasing and increasing trend, respectively over time (HP 4194A)?

Can usage of AI tools like chat GPT in research work is recommendable ?

Usage of internal standards in LC-MS/MS analysis?

ANY free software for reconstructing neurons in the microscopic image?

How effective is the Citi Bloc standard basket in enhancing the accuracy and comparability of international construction cost assessments?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Why after performing site directed mutagenesis ,I don't see any colony after transformation?

Why my negative control siRNA is decreasing the target gene's expression?

Mass spectra averaging algorithm?

How can we identify (in silico) the interacting amino acid residues or the nucleotides involved in the Protein-Protein / Protein-RNA interaction?

Deletion of Nucleotides in Phosphorylated Primer?

Problems to find a region of a Virus gene?

Anybody with experience with whole plasmid sequencing? Should I worry about plasmid oligomers?

BLAST - missing results?

How to use the BLAST primer to check the possible bonds of primer?