Can anyone help with cleaning of EST sequences?

From my experience, when I use VecScreen I get regularly "no significant similarity", but when I checked those sequences by hand, using similarity with vector ends& used primers, often 10-15 bp where from used vector.

The sequence of any vector contamination should theoretically be identical to the known sequence of the vector. In practice, occasional differences are expected to arise from sequencing errors, and less frequently, from engineered variants or spontaneous mutations.

VecScreen considers a match to be terminal if it starts within 25 bases of the beginning of the query sequence or stops within 25 bases of the end of the sequence.

As in your case checking sequences "by hand" is not possible, the requirement to trim the ends of the sequence is reliable.

Jeffrey Schermer

mRNA is extracted from a sample, and a primer-linker is attached. Since mRNAs contain a poly-A tail (a string of As at the end), the primer contains a string of Ts that will hybridize to the poly-A tail, and the short double-stranded end is ready to initiate the duplication to cDNA.

Reverse transcriptase makes a reverse-complemented cDNA copy of the mRNA. This process will go on for some length, but not necessarily to the end of the mRNA transcript - the resulting cDNA strand will thus often start some distance from the beginning of the mRNA.

the RNA is removed, and the cDNA methylated to make it more stable.

polymerase duplicates the cDNA strand, making it double stranded.

adapters are attached to each end of the double-stranded DNA segment. These are short sequences that are designed to match one of the vector’s(below) cloning sites, fixing the DNA segment as an insert in the vector.

the primer-linker contains a target for a restriction enzyme. This enzyme now chops off part of the primer-linker, discarding the adapter at that end, and revealing the sequence that will ligate (bind) to the other cloning site in the vector.

The vector is a short, circular DNA sequence (plasmid or phagemid) that will work like a small chromosome when inserted into a suitable host, typically a bacteria like E. coli. Our DNA segment is now ligated to the vector at both ends, and the whole thing is duplicated by growing a bacteria colony.

At this point, there has been a number of opportunities for mishaps. The primer-linker can ligate to something else than a poly-A tail (1), reverse transcription can be - and often is - cut short (2), cDNA can be incompletely methylated, making it a possible target for restriction enzymes that will chop it up (3), adapters can ligate with each other, forming chimeric sequences (5), the primer-linker can avoid being cut, leading to it being retained in the sequence (6), and the bacteria can assimilate two vectors, or none, the insert can end up in the wrong direction, possibly a vector can acquire an insert from bacterial DNA (7), and there’s probably more.

Sequences and consequences

What this means, is that in addition to actual mRNA sequence, we will usually get sequences containing vector, and quite often primer, linker, adapter, or E.coli sequence as well. This needs to be identified, and discarded (masked or trimmed off) before further analysis.

I’ve been experimenting with the following programs, SeqClean by TIGR and Lucy, by Chou and Holmes (2001).

SeqClean

Pros: Masks against a database of vectors and a database of contaminants. Masks low complexity.

Cons: Does not take into account sequence quality. Inherent model expects vector in the first and second third of the sequence, sometimes failing to trim vector, and sometimes trimming valid sequence. Trims sequences instead of marking them, making it harder to compare results.

Lucy

Pros: Uses knowledge of vector splice sites, giving it the opportunity for more accurate masking and differentiation between vector and spurious matches. Takes quality into account.

Cons: Very picky about vector input, and in particular splice site sequences must be in the correct orientation and sequence. A bug makes it fail on long Fasta headers (although I wrote a fix for that). Can only deal with a single vector at a time, meaning you need to know for each sequence the vector used.

I’ve noticed that even after masking with either program, some vector remains. And, it turns out, I’m not the only one, my buddies at CBU has the had the same problem, and there’s also a paper (Chen et al, 2006). Unfortunately, the remedy suggested there didn’t work for our sequences, and in fact, only seems to help for one particular vector type - that we don’t use.

Currently, I use a rather ugly hack, using RBR to mask, first against E.coli contamination with high stringency, and then against vector, using a more lenient settings. Finally, I use SeqClean to chop off the masked-out bits, as RBR only Xes out the offending parts, but doesn’t trim, and TGICL doesn’t deal gracefully with masked but untrimmed sequences…but that’s another bug.

Can anyone suggest the specification of the test tube used for the tube adherence method for biofilm assay?

HI, How to find phase response in CST?

Pl. share the latest research in the field of leadership development?

Does nanoparticles have significance in absence of light ?

Any recommendations or tips on isolating neutrophils from mouse whole blood?

How can we differentiate between interstitial position and vacancy position in a crystal lattice?

What different magnifications are used to study the structures of microconidia, macroconidia, and chlamydospores in Fusarium?

Which books one should follow to know details about Minkowski functional and its geometry in Banach spaces?

What is the meaning of 4 in the formula of TAN (Titrable Acid Number)?

Could anyone explain how to apply anova (statistical analysis) on the triplicate swelling data of hydrogel?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Are there any instruments for studying time similar to the way it is in space?

Weak DAPI staining after immunohistochemistry - how to improve?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

The Curse of Evolution and Complexity?

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended?

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity?

Interested in a SCOPUS collaboration?