Could someone clarify scepticism regarding bootstrap values?

Sven Le Moine Bauer @Sven-Le-Moine-Bauer

18 November 2015 4 3K Report

Hi everybody,

after building some very simple trees based on 16SrRNA for a publication and reading about bootstrapping, I am getting a bit confused on the proper interpretation of bootstrap values.

Bootstrap will create a bunch of trees using newly made sequences, by selecting random column in the original alignment. This will basically answer the question: Is my tree supported by the whole alignment or only some part of it? So, why is that linked to: can I trust my tree?

Many reviewers will complain about bootstrap values under 70, why? If the value is low, it does not mean that the tree is wrong, simply that only part of the alignment shapes the tree. No? What is the problem with that?

On top of this, as far as I understood, bootstrapping completely omits the weight put on different parts of the sequences (eg, gamma corrections) and neglect evolutionary models.

In this example of 5 short sequences:

in the second column, you can trust that sequence 5 is different from the 4 others, and a A mutated to a T.

However, in the first column, where there is obviously a high rate of mutation, it is impossible to say if sequence 1 and 3 are same because they are photogenically close or just by luck after several mutations both ended up with a A.

This has influence on the tree, and I don't think bootstrapping cares about it.

Too finish, if you would look at the few bp before these sequences: If they are conserved, this should make you think that 1 and 3 are indeed same as we are obviously in a conserved area. If they are completely different, then it is more likely that the 2 A are the result of several random mutations. If you pick randomly different columns (bootstrap), I think you loose the info that is around the column.

So, can somebody help me for this? I am really confused...

Thanks.

Shiraz A Shah

Hi Sven,

These are good questions. Despite having worked with bioinformatics for more than a decade, I have to admit even I have trouble grasping the concept fully. I'll try my best, but anybody here should feel free to correct me.

Here's how I understand it. The phylogenetic tree that you get is always an estimation and a compromise of some sort. What do I mean by that? Well, given almost any alignment, it's practically impossible to compute a perfect tree. That's basically because we're only seeing the end products of evolution, whereas the tree is trying to guess how that evolution might have occurred, without any information about intermediate steps, time, etc.

This basically means that some parts of the tree will be good, reflecting the trends found in the data, while other parts of the tree are a result of the compromises which the tree building algorithm had to make. Bootstrapping is a way of visualising which parts of the tree are strongly supported by the data, and which parts are just the algorithm trying to make ends meet. If you omit parts of the data, the well supported parts of the tree will remain, whereas the parts of the tree that the algorithm was struggling with will start moving around. Bootstrapping tries to quantify that phenomenon.

As for your point about bootstrapping not caring about meaningful vs less meaningful parts of the alignment, that's not how I understand it. Bootstrapping is more about parts of the tree, and less about parts of the alignment. Every part of the alignment is telling its own part of the story about how the sequences evolved and that's why they are all pretty important. Conserved columns will resolve the deeper parts of the tree, while less conserved columns will sort out the shallower parts.

As for some columns being so variable that they end up mutating back, this is already taken into account by modern algorithms. Older algorithms do suffer from a phenomenon called "long branch attraction" because they mistake residues mutating back as being conserved. However, many older algorithms, like neighbour-joining, are still widely used because they are computationally less expensive than say, maximum-likelihood.

Hope this helps clarify things. I'm curious about others' comments on this as well, since it's certainly an issue which a lot of people, including myself, have trouble figuring out.

Andrii Tarieiev

Dear Sven,

It is complex question. Each phylogenetic tree you build is only a probabilistic model. It can be released more or less likely, but nobody knows how the exact way of evolution. Bootstrap is used for validation of tree topology. The general idea: higher bustrap indicates higher chances that tree topology reflected real evolutionary process of studied gene or so. But you need to keep in mind that is only a chance, because traditional phylogenetic analysis have some important restrictions. One of them is back-mutation you notice in your post. Such events are extremely hard to identify and to detect the trace of hanges before back-mutation (in most cases we unable to do this) Another one is how to count indels - as one evolutionary event or more.

Many answers of such difficulties depend on how many data and knowledge we have about every exact object and used sequence. There are some addition analysis techniques invented to minimize the impact from such factors, but ther usability still depend on instance.

As for question why 70%: 16S rDNA is conservative sequence because of their very important functions in protein synthesis. This sequence is under high pressure of selection. So, even few changes in it indicate significant evolutionary distance and low bootstrap will indicate that constructed tree most likely is not realistic and probably not reflect possible way of evolution.

Sven Le Moine Bauer

Hei everybody,

sorry for the delay, thanks all for your answers. They help me understanding more bootstrapping, and trees in general. However, your 3 answers confirm me for this point: Why on earth is everybody so eager to publish the values and so specific about them if in the end their interpretation is so complicated?

@Artur: You are right, the example was over simplified, but with more sequences you could be "quite sure".

@Shiraz: Nice to hear from you! Hope everything goes well for you :)

Shiraz A Shah

As Andrei and Artur have also pointed out, often a phylogenetic tree will depict noise rather than evolution, and the bootstrap values reveal that.

Phylogenetic trees are just really hard to compute reliably, and this means that tree building algorithms will output a tree, no matter if the underlying data supports one or not. That's just an inherent issue of the tree problem. Bootstrapping happens to be useful for gauging the extent of that issue, although it's by no means a very elegant solution.

Nevertheless, phylogenetic trees are extremely useful, and thus we have no choice but to live with such shortcomings. Sequence alignment followed by tree construction actually revolutionised biology about half a century ago, and is currently in the process of revolutionising linguistics in the same way:

http://www.ncbi.nlm.nih.gov/pubmed/26403857

So, for the moment, we have to live with bootstrap values, and we'll do ourselves a favour trying to understand what they're all about.

Help me download paper?

Differentiation THP1 cells into M2 macrophage with IL4 and IL13?

The journal change publisher, will my published paper get indexed?

Why is my thin film PDMS TFM device warping only when with cells?

Can Vapour Modernity dissolve institutions?

Why the controversy surrounding the Hydrolic Apple ad actually signals the return of Analogue?

Can I have full sequence of a purified protein (by SDS-PAGE)?

How to calculate the Magnitude in "Edit Load" section in Abaqus?

Is there a suggested template to follow for a integrative literature review, using Whittemore & Knafl methodology?

Why the bands in my Western Blot did not appear at the expected molecular weight?

How can I prepare virus for a TEM or SEM imaging?

Is there a problem with my RNA pellet?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Strugglling with m6A dot blot any suugesstion ?

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

RNA Extraction Using Hot Borate Method No Longer Working?

How can I apply boundary conditions in an orthotropic steel deck numerical model using ABAQUS software?

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

Can you suggest reliable sources defining "3D mesh" and "3D city models"?