How to find out the the duplicated sequence in the database?

More Mazhar Hussain's questions See All

How to pick context of specific site in a protein or DNA near the edge of sequence for Machine Learning?

Dear all, suppose I am working on a machine learning problem where I am trying to predict an attacking site of an enzyme on its substrate. example: Let suppose an enzyme attacks on aspartic acid...

07 August 2018 9,369 2 View

How to decrease the shrinking of Wells in Hoefer SDS PAGE system?

Hello everybody, I have been using Hoefer SDS PAGE system. It gives fantastic bands. But the only problem with this system is that it's combs are open which causes it's wells to shrink down. It...

05 June 2016 3,396 7 View

What is the true method to quantify the cleaved protein bands in Western blot?

I have been working on a protein which is get cleaved at its c-terminal in response to a viral infection. Whenever, I see my western blots, I can see partial or complete cleavage with two bands....

09 October 2015 5,516 10 View

How to pool real-time PCR data across the experiments?

Hi everyone, I have been working on real-time quite alot. Recently, I performed an experiment with 3 repeats. For me, repeat means doing same experiment again, but at different time. So, I did...

07 August 2015 2,975 5 View

How can I find out the unknown components of a cellular pathway?

Hi everyone, Recently, I have been working on effect of a protein on the expression or activity or stability of another protein. But, I know one thing that the protein which is modulating the...

05 June 2015 9,187 3 View

What are the most conserved regions of the Influenza A virus genome?

Hi everyone, Currently, I am designing primers for the real-time analysis. I have observed in the nucleoprotein and PB2 of the influenza virus that a huge amount of variation is consistently...

05 June 2015 5,746 5 View

How can I completely suspend the MDCK cells for cell counting?

Hello everyone, After trypsin-EDTA treatment to the cells, I resuspend MDCK cells in 10ml of MEM (10%FBS). Then, once I load 10 micro-L of the suspension on Hemocytometer, I get clumps and very...

05 June 2015 9,073 2 View

Is there a problem with my RNA pellet?

Hello, I am currently having problems with RNA extraction. I am using mouse liver (C57BL6J), and I have extracted RNA from mouse liver before. Before this experiment, my final RNA pellets were...

11 August 2024 7,082 3 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

RNA Extraction Using Hot Borate Method No Longer Working?

I've been performing RNA extraction on cotton petiole tissue for a few months now using the method described in the following paper, a derivative of the typical hot borate method...

08 August 2024 9,882 2 View

I can't see the ssDNA band after performing asymmetric PCR. Is there any way to do this?

After performing symmetric PCR, PCR purification was performed. Afterwards, asymmetric PCR was performed using the PCR purification product as a template, but no ssDNA band was confirmed in the...

08 August 2024 1,668 3 View

Does crude extraction using NaOH and Tris work well with Fungi?

I'm trying to find a DNA extraction method for fungi that does not require equipment and heating. Is there anyone who can suggest an alternative option? Thank you

08 August 2024 4,733 2 View

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

I am currently working on LncRNA; to know the lncRNA-protein interactions I want to do RNA pull down assay, so I need to design primers with T7 promoter. I need assistance in this regard.

07 August 2024 6,622 1 View

Can I use a HisTRAP column for affinity chromatography?

I'm working on selecting antibodies against a recombinant protein that has a His-tag. My idea is to first bind the recombinant protein to a HisTRAP column and then use this column for an affinity...

07 August 2024 505 3 View

E.coli contamination in human RNA seq data ?

Recently, we observed that 99% of the sequences in our RNA-seq data corresponded to the E. coli genome. Despite multiple DNAse treatments after RNA extraction and ribosomal depletion, we were...

06 August 2024 807 3 View

RNA later for the preservation of RNA in fecal samples at room temperature for one day (37°C)?

I am planning to collect human fecal samples for metatranscriptomic analysis using MGI. These samples are from indigenous people living in a region with high temperatures. I will have access to a...

06 August 2024 1,367 3 View

Brian Thomas Foley

BLAST will not be a good tool for this. BLAST scores are based on both percentage identity and length of the matched sequence. You don't specify what type of database you are searching (your own built from cDNAs of the alternatively spliced mRNAs from one species; or GenBank, EMBL or other public databases of many species) or what your search query sequence is (the gene with introns; the longest of all the alternatively spliced mRNAs, etc).

Mazhar Hussain

Well these are cDNA sequences (spliced) in FASTA format from NCBI and ENSAMBL.

I am guessing that maybe you are doing dozens of BLAST queries, each with a differently spliced form as the search query, and then form the "hit" results you want a set of unique "hits" removing the duplicates. Are they all from the same species? Or are you interested for example in finding out that humans have spliced forms A, B and C while Gorillas have forms A and C but not B?

Nice question again... No sir, they are all from the human...

Sebastian Jünemann

So you have FASTA files (the extracted splice variants) and want to rule out duplicates?

This is easily be done using either various tool-kits or clustering methods. As far as i know the EMBOSS package contains such a tool (skipredundant ), as also the FASTX-toolkit, PRINSEQ, and many many more. You can also consider to use clustering approaches like UCLUST or CD-Hit, which can be configured in such a way that they will cluster only sequences sharing 100% identity and also are similar in length (they will output also one representative for each such cluster and also provide information about the level of redundancy, i.e. a mapping file).

This task is very very common in sequence analysis. So searching for terms like 'deduplication', 'dereplication', or 'collapse' in combination with FASTA will give you hundreds of hits either to ready to use tool-kits or to customized solutions (based on various scripts languages like bio-python, bio-perl, ...).

Mina Bashir

you can also run unique.seqs() in MOTHUR. you will end up with a fasta file with only unique sequences and a names file telling you how often which sequence occurs.

command is: unique.seqs(fasta=FILE.fasta)

A problem with something like what MIna Bashir is suggesting, is that when you BLAST with a human gene, (say 50 kb with 17 exons interspresed along with introns that may or may not all show up n each message) the BLAST result will contain several "hits" for each mRNA/cDNA sequence in the database. The match to each exon will download as a separate sequence in the output. So assuming there are not SNPs or different alleles and all the different messages are done by alternative splicing a list of 70 different alternatively spliced variants should come out as just the 17 exons using a tool like UNIQUE.SEQS

Also, a tool like UNIQUE.SEQS will count sequences as different if they have a change in the length of poly-A tail or other differences that are not due to alternative spicing.

It sounds like what you really want is to align each mRNA to the complete gene and then see how many different patterns there are. How many different ways do humans splice this gene.

There are some settings in BLAST and BLAST download options, so you get the entire "hit" sequence in on entry rather than broken into the exons that matched your gene. But you will then have to align each of them to your gene.

Thank you everyone for your response... Since this last week, I am trying various different options and some of them have potential to get me unique hits. But as Brian mentioned, uniqueness could be at exon-intron structural level or SNPs. In my case, I want to ignore the SNPs and just go for Splice variants. And I tried Multiple sequence alignments to find out the unique patterns but the task seemed too big for the manual execution and MSA becomes really challenging when you have a lot of alternatively spliced products of all sort of sizes. In this case, it will generate unnecessary gaps or misalign the sequences.

So, kindly let me know what kind of tools or BLAST settings I shall use to get the results.

Regards

Prashanth N Suravajhala

Dear Mazhar

Very interesting suggestions above. I did face a similar problem, for hypothetical protein sequences as a query though. I have always trusted Virginia FASTA instead of traditional BLAST for that the former is highly sensitive and you have an option to use FASTX Clipper to check redundant sequences. Please give it a try: fasta.bioch.virginia.edu/

Hope this helps

Prash

Kandasamy Ulaganathan

Dear Hussain

You can identify duplicate sequences easily in Bioedit. Open a new alignment first. Copy all sequences and import into Bioedit. Go to "Sequence" menu and select sort and Group top-down by identical sequences. It will group identical sequences together which you can remove easily.

Sincerely

K.ULAGANATHAN

Thank you everyone,

and lately thanks to Prashanth and Kandasamy for nice suggestions. I am trying the ideas suggested by all of you. I am very very sure that one or more if them will suit my need. Thank you very much again.

Yours Sincerely,