How are sequence similarity clusters defined in the Protein Data Bank?

25 October 2020 3 6K Report

Hi,

I am new to the field of structural biology and am trying to understand how sequence similarity clusters are defined in the Protein Data Bank. As a non-scientist, I would be grateful for a high-level answer.

Specifically, my question is: what is the definition of a sequence similarity cluster and does the definition stay the same over time?

a.) In other words, once put into a sequence similarity cluster, does the protein chain always stay in that cluster? That is, are the same protein chains always grouped together in a cluster (although new chains may get added to the cluster over time as the Protein Data Bank grows)?

b.) Or, is it the case that protein chains get grouped with different chains over time as the Protein Data Bank grows?

c.) How are new sequence similarity clusters born?

Thank you!

Annemarie Honegger

Since no one has supplied any answer yet, this is a first step:

"Pre-calculated protein structure alignments at the RCSB PDB website "

Andreas Prlić, Spencer Bliven, Peter W. Rose, Wolfgang F. Bluhm, Chris Bizon, Adam Godzik, Philip E. Bourne

Bioinformatics, Volume 26, Issue 23, 1 December 2010, Pages 2983–2985, https://doi.org/10.1093/bioinformatics/btq572

http://europepmc.org/article/med/20937596

SUMMARY: With the continuous growth of the RCSB Protein Data Bank (PDB), providing an up-to-date systematic structure comparison of all protein structures poses an ever growing challenge. Here, we present a comparison tool for calculating both 1D protein sequence and 3D protein structure alignments. This tool supports various applications at the RCSB PDB website. First, a structure alignment web service calculates pairwise alignments. Second, a stand-alone application runs alignments locally and visualizes the results. Third, pre-calculated 3D structure comparisons for the whole PDB are provided and updated on a weekly basis. These three applications allow users to discover novel relationships between proteins available either at the RCSB PDB or provided by the user. AVAILABILITY AND IMPLEMENTATION: A web user interface is available at http://www.rcsb.org/pdb/workbench/workbench.do. The source code is available under the LGPL license from http://www.biojava.org. A source bundle, prepared for local execution, is available from http://source.rcsb.org CONTACT: [email protected]; [email protected].

Annemarie Honegger

More precisely, BlastClust https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html http://nebc.nox.ac.uk/bioinformatics/docs/blastclust.html:

"BLASTClust (Altschul et al., 2004) is used to cluster all protein chains by sequence similarity. We require 90% overlap between all sequences in a cluster. Therefore, a shorter fragment (e.g. a single domain) of a longer sequence (e.g. a multi-domain protein) will usually not be in the same cluster as the whole sequence. Within clusters, sequences are ranked by experimental method, resolution and release date. "

How can I prepare virus for a TEM or SEM imaging?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How to confirm the site-directed mutagenesis result without performing NGS?

How are iso-frequency contours plotted?

Can I use a HisTRAP column for affinity chromatography?