Using DNA Barcodes to Identify and Classify Living Things: Bioinformatics
I. Use BLAST to Find DNA Sequences in Databases (Electronic PCR)
Perform a BLAST search as follows:
Do an Internet search for "ncbi blast."
Click the link for the result BLAST: Basic Local Alignment Search Tool. This will take you to the Internet site of the National Center for Biotechnology Information (NCBI).
Under the heading "Basic BLAST," click "nucleotide blast."
Enter the primer set you used into the search window. These are the query sequences.
The following primers were used in this experiment:
Plant rbcL gene
rbcLa f 5’- ATGTCACCACAAACAGAGACTAAAGC-3’ (forward primer)
Omit any non-nucleotide characters from the window because they will not be recognized by the BLAST algorithm.
Under "Choose Search Set," select "NCBI Genomes (chromosome)" from the pull-down menu.
Under "Program Selection," optimize for "Somewhat similar sequences (blastn)."
Click "BLAST". This sends your query sequences to a server at the National Center for Biotechnology Information in Bethesda, Maryland. There, the BLAST algorithm will attempt to match the primer sequences to the DNA sequences stored in its database. A temporary page showing the status of your search will be displayed until your results are available. This may take only a few seconds or more than a minute if many other searches are queued at the server.
The results of the BLAST search are displayed in three ways as you scroll down the page:
First, a "Graphic Summary" illustrates how significant matches, or "hits," align with the query sequence. Why are some alignments longer than others?
This is followed by "Descriptions of sequences producing significant alignments," a table with links to database reports.
The accession number is a unique identifier given to a sequence when it is submitted to a database, such as Genbank. The accession link leads to a detailed report on the sequence.
Note the scores in the "e" column on the right. The Expectation or E value is the number of alignments with the query sequence that would be expected to occur by chance in the database. The lower the E value, the higher the probability that the hit is related to the query. For example, an E value of 1 means that a search with your sequence would be expected to turn up one match by chance.
What is the E value of your most significant hit, and what does it mean? What does it mean if there are multiple hits with similar E values?
What do the descriptions of significant hits have in common?
Next is an "Alignments" section, which provides a detailed view of each primer sequence ("Query") aligned to the nucleotide sequence of the search hit ("subject"). Notice that hits have matches to one or both of the primers:
Predict the length of the product that the primer set would amplify in a PCR reaction (in vitro).
In the "Alignments" section, select a hit that matches both primer sequences.
Which nucleotide positions do the primers match in the subject sequence?
The lowest and highest nucleotide positions in the subject sequence indicate the borders of the amplified sequence. Subtracting one from the other gives the difference between the coordinates.
However, the PCR product includes both ends, so add 1 nucleotide to the result that you obtained in Step 3.c. to determine the exact length of the fragment amplified by the two primers.
What value do you get if you calculate the fragment size for other species that have matches to the forward and reverse primer? Do you get the same number?
Determine the type of DNA sequence amplified by the primer set:
Click the accession link (beginning with "ref") to open the data sheet for the hit used in Question 3 above.
The data sheet has three parts:
The top section contains basic information about the sequence, including its basepair (bp) length, database accession number, source, and references to papers in which the sequence is published.
The bottom section lists the nucleotide sequence.
The middle section contains annotations of gene and regulatory "FEATURES," with their beginning and ending nucleotide positions ("xx..xx"). These features may include genes, coding sequences (cds), regulatory regions, ribosomal RNA (rRNA), and transfer RNA (tRNA).
Identify the feature(s) located between the nucleotide positions identified by the primers, as determined in 3.b. above.
II. Determine Sequence Relationships Using the Blue Line
The following directions explain how to use the Blue Line of DNA Subway to analyze novel DNA sequences generated by a DNA sequencing experiment. If you did not sequence your own DNA sample, you can follow these directions to use DNA sequences produced for other students. You can find supplementary instructions by clicking on the "manual" link on the DNA Subway homepage.
DNA Subway is an intuitive interface for analyzing DNA barcodes. Generally, you progress in a stepwise fashion through the button "stops" on each "branch line." An "R" indicates that analysis is available. A blinking "R" indicates an analysis is in process. A "V" means that results are ready to view.
You can analyze relationships between DNA sequences by comparing them to a set of sequences you have compiled yourself, or by comparing your sequences to others that have been published in databases such as GenBank (National Center for Biotechnology Information). Generating a phylogenetic tree from DNA sequences derived from related species can also allow you to draw inferences about how these species may be related. By sequencing variable sections of DNA (barcode regions) you can also use the Blue Line to help you identify an unknown species, or publish a DNA barcode for a species you have identified, but which is not represented in published databases like GenBank (www.ncbi.nlm.nig.gov/genbank).
Create a DNA Subway Project and Upload DNA Sequences
Log into DNA Subway at www.dnasubway.org. If you do not have an account, you will need to register first to save and share your work.
Select "Determine Sequence Relationships" (Blue Line) to begin a project.
Select "rbcL" or "COI" from the "Select Project Type" section. (rbcL (plant) sequences must be analyzed separately from COI (animal) sequences.) If you are analyzing a barcode region that is not listed, select "DNA."
"Select Sequence Source" provides several ways to obtain sequences for barcode analysis:
Upload sequence(s) in ab1 (files ending with .ab1) or FASTA format. Click "Browse" to navigate to a folder on your desktop or drive containing your sequence(s). Select a sequence by clicking on its file name. Select more than one sequence by holding down the ctrl key while clicking file names. Once you have selected the sequences you want, click "Open".
Enter a sequence in FASTA format. Below is an example of this format. The ">" symbol demarcates the sequence name. The sequence is started on the next line.
>sequence name
atcgccccttaatattgcctt…
Import a sequence/trace from the DNALC. Click your tracking number. Select one or more files from the list. Click to "Add" selected files.
Select a sample sequence.
Provide a title in the "Name Your Project" section.
Write a short description of your project in the "Description" section (optional).
Click "Continue" to load the project into DNA Subway.
View and Build Sequences
There are many plants, animals, and fungi which do not have a documented barcode sequence. For instance, there are an estimated 350,000 species of angiosperms (flowering plants), but as of June 2013 there were only about 74,000 rbcL angiosperm sequences in GenBank. For other species, diversity in the barcode sequences are not well characterized. This means that there are opportunities to submit novel sequences and contribute to the global barcoding effort. Only samples that have high quality sequence for both the forward and reverse reads are good enough to ensure a low error rate and can be published to GenBank, so the sequence quality must be checked. Sequences for which there is only one high quality read are not be considered high enough quality to publish. These sequences and those with no high quality sequence are can still be analyzed even though the results are not publishing quality.
On the "Assemble Sequences" branch line, Click "Sequence Viewer" to display the sequences you have input in the project creation section. If you did not upload trace files, you can scroll to see the sequence. If you uploaded trace files, click on the file names to view the trace files.
The DNA sequencing software measures the fluorescence emitted in each of four channels – A,T,C,G – and records these as a trace, or electropherogram. In a good sequencing reaction, the nucleotide at a given position will be fluorescently labeled far in excess of background (random) labeling of the other three nucleotides, producing a "peak" at that position in the trace. Thus, peaks in the electropherogram correlate to nucleotide positions in the DNA sequence.
A software program called Phred analyzes the sequence file and "calls" a nucleotide (A, T, C, G) for each peak. If two or more nucleotides have relatively strong signals at the same position, the software calls an "N" for an undetermined nucleotide.
Phred also examines the peaks around each call and assigns a quality score for each nucleotide. The quality scores corresponds to a logarithmic error probability that the nucleotide call is wrong, or, conversely, to the accuracy of the call.
Phred Score
Error Accuracy
10
1 in 10 90%
20
1 in 100 99%
30
1 in 1,000 99.9%
40
1 in 10,000 99.99%
50
1 in 100,000 99.999%
The electropherogram viewer represents each Phred score as a blue bar. The horizontal line equals a Phred score of 20, which is generally the cut-off for high-quality sequence. Thus any bar at or above the line is considered a high-quality read. What is the error rate and accuracy associated with a Phred score of 20?
Every sequence "read" begins with nucleotides (A,T,C,G) interspersed with Ns. In "clean" sequences, where experimental conditions were near optimal, the initial Ns will end within the first 25 nucleotides. The remaining sequence will have very few, if any, internal Ns. Then, at the end of the read the sequence will abruptly change over to Ns.
Large numbers of Ns scattered throughout the sequence indicate poor quality sequence. Sequences with average Phred scores below 20 will be flagged with a "Low Quality Score Alert." You will need to be careful when drawing conclusions from analyses made with poor quality sequence. What do you notice about the electropherogram peaks and quality scores at nucleotide positions labeled "N"?
Note: The exclamation icon (!) indicates poor quality sequence.
Use the “X” and “Y” buttons to adjust the level of zoom. You can undo zooming by pressing the “Reset” button.
Examine the quality of the sequence(s). Any sequence for which the forward or reverse has the warning icon indicating a low quality score in not of good enough quality to publish and any determination of novelty will be tentative as sequencing errors could appear to be novel polymorphisms.
Click “Sequence Trimmer” to trim your sequences; this automatically remove Ns from the 5’ and 3’ ends of selected sequences. Click again to view the trimmed sequences. Why is it important to remove excess Ns from the ends of the sequences?
If you wish to view trimmed sequences, click on the file name.
Pair and Build Consensus for Forward and Reverse Reads
If you have two reads for a sample, pair the sequences by checking the box to the right of each read for the sample. By default, DNA Subway assumes that all reads are in the forward orientation, and displays an "F" to the right of the sequence. If any sequence is not in that orientation, click the "F" to reverse compliment the sequence. The sequence will display an "R" to indicate the change.
After checking the second read, a dialogue box will appear asking if you wish to designate the sequences as a pair. Alternatively, Click "Try auto pairing" to pair sequences which have identical sample names, but appended with of F or R based on sequencing direction.
Click "Save" to save your pair assignments.
Once you have created sequence pairs, click “Consensus Editor” to make a consensus sequence from both sequences in the selected pairs. To examine the consensus sequence click “Consensus Editor” again, and then click on the link to the pair you wish to examine. How does the consensus sequence optimize the amount of sequence information available for analysis? Why does this occur?
If there are any mismatched nucleotides between the first and second sequence, these will be highlighted yellow in the consensus editor window. Do differences tend to occur in certain areas of the sequence? Why?
Large numbers of yellow mismatches – especially in long blocks – may indicate that you have incorrectly paired sequences from two different sources (organisms), or that you failed to reverse complement the reverse strand.
Return to Pair Builder to check your pairs and reverse complements.
Click the red "X" to redo a pairing, and toggle "F" and "R" settings, as needed.
A large number of mismatches in properly paired and reverse complemented sequences indicate that one or both sequences is of poor quality. Often, one of the sequencing reactions produces a high quality read that can be used on its own. To determine this:
Examine the distribution of Ns to see if they are mainly confined to one of the two sequences.
Examine the electropherograms to see if one of the two sequences is of good quality.
If one of the sequences seems of good quality, return to Pair Builder, and click the red X to undo the pairing.
Continue on to Step 4.
Few or no internal mismatches indicate good quality sequence from forward and reverse reads. If you like, you can check the consensus sequence at yellow mismatches and override the judgment made by the software:
Click a highlighted mismatch to see the electropherograms and Phred scores for each read.
Click the desired nucleotide in the black rectangle to change the consensus sequence at that position. You should only change the consensus if you have a strong reason to believe the consensus is wrong.
Click the button to Save Change(s).
BLAST Your Sequence
A BLAST search can quickly identify any close matches to your sequence in sequence databases. In this way, you can often quickly identify an unknown sample to the genus or species level. It also provides a means to add samples for a phylogenetic analysis.
On the Add Sequences branch, click "BLASTN". Then, click the "BLAST" button next to the sequence you want to query against DNA databases.
The returned list has information about the 20 most significant alignments (hits):
Accession number, a unique identifier given to each sequence submitted to a database. Prefixes indicate the database name – including gb (GenBank), emb (European Molecular Biology Laboratory), and dbj (DNA Databank of Japan).
Organism and sequence description or gene name of the hit. Click the genus and species name for a link to an image of the organism, with additional links to detailed descriptions at Wikipedia and Encyclopedia of Life (EOL).
Several statistics allow comparison of hits across different searches. The number of mismatches over the length of the alignment gives a rough idea of how closely two sequences match. The bit score formula takes into account gaps in the sequence; the higher the score the better the alignment. The Expectation or E-value is the number of alignments with the query sequence that would be expected to occur by chance in the database. The lower the E-value, the higher the probability that the hit is related to the query. For example, an E-value of 0 means that a search with your sequence would be expected to turn up no matches by chance. Why do the most significant hits typically have E-values of 0? (This is not the case with BLAST searches with primers.) What does it mean when there are multiple BLAST hits with similar E-values?
Examine the last column in the report called “Mismatches.” For barcodes, this is the informative column, with the best hits being those with the lowest number of mismatches. Note that hits with low numbers of mismatches can sometimes be lower on the list, as the bit scores are used to arrange the hits in the table. High bit scores can occur when the alignment length is longer, even when there are more mismatches than for other hits.
If there are zero mismatches between your sequence and a BLAST result, it is unlikely that your sequence is unique. Instead, the identical sequences probably match because they are in the same taxonomic group as your sample. Check to see if the matching sequences are from species that seem reasonable for your sample. If your best matches include some mismatches, you may have identified a novel barcode. The more mismatches you find, the more likely that your sequence is unique, especially in regions of the sequence with high quality scores. However, sequencing errors could explain the difference, so it will be important to reexamine the trace files at any sites with mismatches to ensure that the consensus at those locations is of high quality.
Add BLAST sequence data to your phylogenetic analysis by checking the box(es) above any accession number(s), then clicking on "Add BLAST hits to project" at the bottom of the BLAST results window.
Add Sequences to Your Analysis
Click“Upload Data” to add additional sequence data to your analysis without starting a new project. Use “Upload Sequence(s)” to upload AB1 trace files or FASTA formatted sequences locally stores on your computer; Use “Enter Sequences(s)” to paste or type sequences in FASTA format.
If you would like to import sequences from non-local sources you can use “Import Sequence” to search a sequence database using a sequence identifier. For GenBank sequences you can search by identification number (GI or Version). Search BOLD by species name, or search the DNALC sequence database by tracking number for sequences you processed with GENEWIZ through the DNALC system.
If your sequence had no hits with zero mismatches, you may use NCBI BLAST to confirm that the sequence is novel. Click on the BLASTN button and then double-click on the sequence (the actual nucleotides) that you identified as possibly novel to select them. Right-click (PC) or command-click (Mac) and then select copy to move the sequence to your clipboard.
In a web browser go to http://blast.ncbi.nlm.nih.gov. From this page click on "nucleotide blast."
Paste your sequence into the "Enter Query Sequence" window under "enter accession number(s), gi(s), or FASTA sequence(s)."
Under "Program Selection" select "Highly similar sequences (megablast)"; next click "BLAST."
On the results page you will get a list of results very similar to what was returned by DNA Subway.
Scrolling down the page, you will find alignments of your sequence (Query) to the sequences from the closest matches in GenBank (Sbjct).
Analyze the results of the BLAST search, which are displayed in three ways as you scroll down the page:
First, a graphical overview illustrates how significant matches (hits) align with the query sequence. Matches of differing lengths are indicated by color-coded bars. For barcoding results, it is likely that most matches will be red, indicating high scores, and cover most of the width of the table, showing matches that span the length of your query sequence.
This is followed by a table with "Descriptions of sequences producing significant alignments” much like the table for BLAST results in DNA Subway.
Next is an "Alignments" section, which provides a detailed view of each primer sequence ("Query") aligned to the nucleotide sequence of the search hit (S"bjct," "subject").
From the table, identify any matches that are 100% identical or any matches with high identity that appear to represent species or sequences you have not identified previously. Select these sequences by clicking on the box to the left of each hit. After selecting sequences, click Download, ensure FASTA (complete sequence) is selected, and then click Continue.
Open the resulting FASTA file (named seqdump). Double-click the sequences to select them all, then right-click (PC) or command-click (Mac) and select copy to move the sequence to your clipboard. Add these sequences to your project using the Upload Data function, as in step 1.
Click on "Sequence Viewer" back on DNA Subway, and view the trace file for the forward read of your query sequence. Locate the position on your table where the query sequence differed from the GenBank match. Determine if the nucleotides you identified as different were of high quality (e.g. not sequencing errors). Because of sequence trimming, you may have to search for the polymorphic site, as the numbers from the BLAST alignment and in the trace file may not correspond.
You may also choose to search for your sequence at the International Barcode of Life (IBOL) database, BOLD (Barcode of Life Online Database); their records are not all in GenBank.
Click on the BLAST button and then double-click on the nucleotides for the sequence you are analyzing. Right-click (PC) or command-click (Mac) and then select copy to move the sequence to your clipboard.
In a web browser go to http://boldsystems.org. From this page, click on “Identification.”
Select the tab that corresponds to the appropriate kingdom for the sample (animal, plant, or fungal).
Select “Species Level Barcode Records.”
Paste the sequence into the search box labeled “Enter sequences in fasta format”; next click “Submit.”
Again, a results table is produced. The column labeled “similarity” indicates how similar your sequence was to the records in the BOLD, with a 100% match indicating they were exact matches. Some records in BOLD are not public, or are not accompanied by species-level identifications. Scrolling down the list of matches you will see a pairwise alignment of your sequence (Query) to the matched sequences (Subj). Once again, identify any new hits that may be identical to your sequence. For published hits, you can download the sequence by clicking the link to the right of “Published,” then clicking “FASTA” and saving the file. This FASTA file can be uploaded, as described above, at step 1.
Click “Reference Data” (optional) to include additional sequences. Depending on the project type you have created, you will have access to additional sequence data that may be of interest. For example, if you are doing a DNA barcoding project using the rbcL gene, samples of rbcL sequence from major plant groups (Angiosperms, Gymnosperms, etc.) will be provided. Choose any data set to add it to your analysis; you will be able to include or exclude individual sequences within the set in the next step.
Analyze Sequences: Select and Align
Many unknown species can be rapidly identified by a BLAST search. In this case, a phylogenetic analysis adds depth to your understanding by showing how your sequence fits into a broader taxonomy of living things. If your BLAST search fails to identify your sequence, phylogenetic analysis can usually identify it to at least the family level.
Click “Select Data” to display all the sequences you have brought into your analysis, including “user data,” BLAST hits, or reference data. Check off sequences you wish to include in an alignment. In general, to determine the relationship of your sequence to species with known barcodes, it is best to concentrate on similar sequences. For instance, you should align sequences from samples that you believe are the same species and any close matches from database searches. You may also use the “Select all” feature to include all sequences; to deselect all sequences, click “Select all” twice. You may run new alignments or download different sequences at any time after selecting a new set of sequences.
To download selected sequences to a FASTA file click the “Download” button and save the resulting file.
To save your selections, click “Save Selections” in the blue dialog box that appears when you make any selections.
Click “MUSCLE” to run the MUSCLE multiple alignment software. This software will align all sequences that were included in the “Select Data” step. Click “MUSCLE” again to open the created multiple alignment. An alignment that is suitable for creating a phylogenetic tree will have an overall high consensus score (represented by the height of the black bars on the lower portion of the alignment window).
Scroll through your alignments to see similarities between sequences. Nucleotides are color coded, and each row of nucleotides is the sequence of a single organism or sequencing reaction. Columns are matches (or mismatches) at a single nucleotide position across all sequences. Dashes (-) are gaps in sequence, where nucleotides in one sequence are not represented in other sequences.
Note that the 5’ (leftmost) and 3’ (rightmost) ends of the sequences are usually misaligned, due to gaps (-) or undetermined nucleotides (Ns). What causes these problems?
Note any sequence that introduces large, internal gaps (-----) in the alignment. This is either poor quality or unrelated sequence that should be excluded from the analysis. To remove it, return to Select Data, uncheck that sequence, and save your change. Then click "MUSCLE" to recalculate.
You will need to “trim” the alignment. To trim, click the “Trim Alignment” button on the upper-left of the Alignment Viewer. Why is it important to remove sequence gaps and unaligned ends?
Analyze Sequences: Create a Phylogenetic Tree
Click "PHYLIP ML" to generate a phylogenetic tree using the maximum likelihood method. A tree will open in a new window; and the MUSCLE alignment used to produce it will open in another window.
A phylogenetic tree is a graphical representation of relationships between taxonomic groups. In this experiment, a gene tree is determined by analyzing the similarities and differences in DNA sequence.
Look at your tree.
The branch tips are the DNA sequences of individual species or samples you analyzed. Any two branches are connected to each other by a node (n), which represents the common ancestor of the two sequences.
The length of each branch is a measure of the evolutionary distance from the ancestral sequence at the node. Species or sequences with short branches from a node are closely related; those with longer branches are more distantly related.
A group formed by a common ancestor and its descendants is called a clade. Related clades, in turn, are connected by nodes to make larger, less-closely related clades.
Click a node (n) to highlight sequences in that clade. Click the node again to deselect the clade. What assumptions are made when one infers evolutionary relationships from sequence differences?
Generally, the clades will follow established phylogenetic relationships ascending from genus > family > order > class > phylum. However, gene and phylogenetic trees do disagree on some placements, and much research is focused on "reconciling" these differences. Why do gene and phylogenetic trees sometimes disagree?
Find and evaluate your sequence’s position in the tree.
If your sequence is closely related to any of the reference or uploaded sequences, it will share a single node with those species.
If your sequence is identical to another sequence, the two will diverge directly from the node without branches.
If your sequence is distantly related to all of the species in your tree, your sequence will sit on a branch by itself – with the other sequences grouping together as a clade.
To identify the smallest clade that includes your sequence, click the node that is directly connected to your sequence. The sequences that are highlighted are the closest relatives of your sequence in the tree.
Look at the scientific names of sequences within the most closely associated clade. If all members share the same genus name, you have identified your sequence as belonging to that genus. If different genus names are represented, check and see if they belong to the same family or order.
Return to the menu, and click "PHYLIP NJ" to generate a phylogenetic tree using the neighbor joining method. How does it compare to the maximum likelihood tree? What does this tell you?
If neither tree places your sequence within an identifiable clade – or if that clade is only at order level – you will need to add more sequences that may increase the resolution of your analysis. Return to Step 5, and add more reference sequences or obtain sequences within the order or family clade that contained your sequence. Then repeat Steps 6-7 to select, align, and generate trees from your refined data set.
Exporting Sequences to GenBank
If you do not identify any identical hits through searches in DNA Subway, GenBank, and BOLD and you have determined that your sequence is of high quality, you may have a novel sequence.
Once you have identified a potentially novel sequence there are additional steps that you can take, including publishing your sequence to GenBank through DNA Subway. It is not required that a sequence be novel to publish it to GenBank. However, discretion should be used, and sequences that are already present in GenBank multiple times for a particular species or without vetted metadata (definitive species identification, collection information, etc.) should not be published.
Note: Only high quality consensus sequences that have been generated by a submitter, and have not been previously submitted can be exported to GenBank.
Click “Export to GenBank” in the project window.
Click “New submission.” (If you are working with an animal sample, you need to specify if it is from a vertebrate, invertebrate, or echinoderm) then Click “Proceed.”
Click “New submission.” (If you are working with an animal sample, you need to specify if it is from a vertebrate, invertebrate, or echinoderm) then Click “Proceed.”
If you have already collected information of your samples in the DNALC Barcoding Samples Database, write the sample’s code number. Its information will be retrieved automatically. If not, you can enter the sample information manually in the next step; click “Continue.”
Verify and fill in the information required in the “Specimen info” window; click “Continue”.
Add photos of the sample if you have any available.
Verify your submission information, make any appropriate changes if necessary, and finally click “Submit.” You will receive a notification that your sequence has been submitted to NCBI and a specialist there will check it. If your submission passes NCBI’s verification procedure, you will receive a notification that your sequence has been published in GenBank.
Good you got 96 % similarity with the data available on the NCBI site (was it Gene Bank data bank). Now Blast will give you a phylogenetic tree for your PCR product, but it will be very overcrowded. I would recommend you use some soft ware like Meg4 which is easily available and you will be satisfied with out put of the phylogenetic tree.
I have a forward and reverse sequence. I know I need to combine these sequences to see the overlap/final PCR produce sequence. What is this called? I'm using GeneStudio.
It depends entirely on you and the objective of your research. You must have had something in mind before performing the sequencing. So, do what you intended to do with the gene sequences you have obtained.