Is there a way to reduce in the results of BLAST search the number of sequences that are too similar among each other?

More Gianluca Molla's questions See All

Interpolation of product formation in biocatalysis using integrated Michaelis-Menten equation?

I want to interpolate the amount of product formed (as concentration or % of conversion) vs reaction time in a biocatalysis process. The fitting equation should have as (y) the amount of product...

31 December 2017 7,337 3 View

Concentration of exogenous H2O2 in E. coli medium?

I would like to induce production of ROS-scavenging enzyme in E. coli before induction of my recombinant protein. Which is the final concentration of H2O2 I should add to the medium? Should I add...

09 October 2017 4,239 2 View

Which artificial electron acceptor could be used instead of cytochrome b?

I'm working with a purified membrane protein (from E. coli) that is supposed to transfer electrons to cytochrome b. Which compound could I use instead of the putative physiological acceptor...

04 May 2015 777 2 View

What can be used to block E. coli cytochrome b?

I'm working with an enzyme that probably reduces cytochrome b of E. coli. How can I inhibit cytochrome b so that the electron transfer would be blocked? Literature search showed several ihibitors...

04 May 2015 6,739 2 View

Is there an "acid catalase"?

I'm looking for a catalase that is able to work at low pHs (about pH 2-3). From PubMed and Brenda database I was not able to find such a peculiar catalase. Probably this low pH is not compatible...

10 November 2014 3,815 8 View

How many H-bonds can the N atom of the peptide bond form (as a donor and an acceptor)?

For sure the nitrogen atom of the peptide bond can act as a donor of one H-bond. Can the lone pair of the same atom also be used to accept a H-bond?

03 April 2014 8,279 1 View

Bioinformatic tools for prediction of protein solubility?

Which are the best online servers for the prediction of the solubility of a recombinant protein expressed in E. coli?

10 November 2013 880 5 View

A simple way to calculate the binding energy between a ligand and a protein?

I would like to have an estimate of the binding energy due to the non-bonded interactions between a ligand and a binding site on a protein. I used to get this data from the output of docking...

07 August 2013 2,710 21 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

Are there any instruments for studying time similar to the way it is in space?

There are a huge number of methods for studying objects in space, according to the senses (and not only). Mechanical, thermal, optical, acoustic, electrical, magnetic, based on particle beams,...

06 August 2024 7,102 0 View

Weak DAPI staining after immunohistochemistry - how to improve?

After immunohistochemistry of previously fixed in PFA and EtOH and then frozen 20 μm sections of zebrafish brain, DAPI staining is very weak (right) compared to the same sections stained without...

05 August 2024 9,637 2 View

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

I aim to be as skeptical as possible regarding whether a pair of orthologous genes results in the same phenotype in their different but related bacterial organisms under similar environmental...

05 August 2024 6,787 4 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended?

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended? If no axillary lymph node dissection was not...

05 August 2024 8,056 1 View

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity?

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity? What is the acceptable percentage of error (regardless of the metric)? Could you suggest...

03 August 2024 5,358 0 View

Interested in a SCOPUS collaboration?

Hi RG family. My team and I are working on some SCOPUS publications and we need co-authors who are willing and capable of undertaking both qualitative and quantitative-based studies. The scope...

02 August 2024 7,843 0 View

Dipon Das

If you are looking for evolutionary distant proteins , you can try the PSI-BLAST and modify the default settings after each iteration !!

Harinder Singh

Exclude those taxon and keep one species in, u can keep out others in your result

Andrea Mafficini

Are you already using only the refseqs as database to search?

Gianluca Molla

I do not suppose since I just used the default setup of Blast.

As I see that refseqs is non redundant database, probably it will be

more "clean" the default database (even if probably still hosts

sequences coming from very close organisms).

Again I wonder if there is a simple cut-off parameter in BLAST to

exclude from results sequences too similar.

This guy has some words to spend about it, I guess :)

http://www.youtube.com/watch?v=Ud_6VpX5AgI

I.e. Try with the refseq database and the blosum45 matrix for comparison

Aditya Upadrasta

Hi Gianluca, As others suggested you can use ref sequence databases particularly with reference to your sequence. MEGABLAST using high E-value thresold can also minimize the number of sequences. However during BLAST searches paralogs cannot be eliminated completely in case if you are trying to construct a phylogenetic tree use highly identical sequences and you should include the sequences from different type strains (Due to the variations in the genes within the species). Once you have set up the thresold E-value and select only those sequences you are interested in to get the FASTA files.

Good Luck

Cheers

Aditya

Thank you all for your comments. I'll try in the upcoming weeks suggestions.

Gianluca

Romain Studer

What you can do is to take all your results with BLAST, and then pass your unaligned sequences to CD-HIT:

http://www.bioinformatics.org/cd-hit/

Nice tool to reduce redundancy in a bunch of sequences.

Thomas Meinel

Hi Gianluca,

is it not an advantage to have so much alternatives, is it? The question is: where does the alternatives come from? Are you interested in a particular reference sequence and you do not want to mitt it? UniProt includes much identical or fragmentary sequences into the database, so, it would be another advantage to reduce the basic set. UniRef (90, 50) would here be an solution. But what you find in a respective subset is very inhomogenuos. Take care about this!

In SYSTERS ( http://systers.molgen.mpg.de ) we reduced the full protein sequence set (2003; across all species; 1.2 Mio protein sequences to ~600 k sequences by a 80/80 criterion: 80 % similarity in 80 % of the sequence length) towards the so called non-redundant sequence set. If you do a BLAST there against the SYSTERS consensus database (~150 k clusters) you will meet the protein family (with multiple non-redundant sequences but also with all redundant sequences, too). The question is: Do you have new sequences and do you want to find the meaning (functionality) ? Which is the reference sequence you are looking for? Maybe the optimal strategy is that you meet the family easily and quickly and then go into the detail with all similar sequences in/of that family to find the most applicable feature as the other people were mentioning (species, paralog-ortholog, gene functionality, etc). So have a look into SYSTERS or any other protein family approach.

Greetings, Thomas

Saurabh Gayali

Hmm. If I am right u r talking of getting less than 10 similar output per query in ncbi BLAST (Which is min option), and in that case try BLAST2GO

Héctor Valverde

Have you try with PSI-BLAST?

Daniel Clark

I just noticed this post which may provide you with some helpful methods also:

http://www.researchgate.net/topic/Bioinformatics_and_Computational_Biology/post/Looking_for_a_method_to_filter_out_data_from_related_BLAST_results2

Miguel Andrade

We did a prototype system that does this: http://www.ogic.ca/projects/bluster/

huda adnan

Sorry not understand alot about molecular and gene sequences.

Brian Thomas Foley

In the BLAST search form, in the Choose Search Set box, there is a line for ORGANISM with a check-box for EXCLUDE next to it. This can be very useful for the problem you describe. Using the Taxonomy browser:

http://www.ncbi.nlm.nih.gov/taxonomy/ you can fill in the ORGANISM with either the name or tax-ID number at any level of the taxonomy you wish, and then ask to exclude hits from those taxa.

For example, if I am interested in mammals and I have a Mus musculus gene of interest that has been sequenced from thousands of strains of mice, I can exclude all Mus musculus, or all Mus genus, or all rodents, in order to just find the gene in other more distantly related species. I usually also do a second run, exluding all mammals, to get a marsupial, fish, amphibian or other sequence(s) to use as the outgroup, if I am interested in the mammals.

Rogelio Rodríguez-Sotres

You may want to use JALVIEW (http://www.jalview.org/). It is ideal for your problem.

1. Download your sequences raw or in an alignment (many formats are recognized).

2. Load them into Jalview, or you may use JalView to download the sequences through their accesion numbers (File fetch). Many databases are supported.

3. If not aligned, use Jalview abilities to request alignments from external sites: MAFFt, Probcons, clustalX, T-coffe, and muscle are supported.

4. Use the Edit/Remove redundancy tool te cut down sequences that are too similar to each other. Just select the cut-off value (as you move the bar the programs highlights the sequences to be removed, so you can easily select the desired level of similarity to use as cut-off).

5. Save your file. Again many formats are available & voilà - you are done.

good luck

Try cdhit