Hi everyone,

I am trying to build a phylogenetic tree with a few hundred 16S sequences of different sizes. About half my dataset is around 700 bp, the rest is complete sequences (around 1500 bp).

Obviously, when I align these sequences, a lot of positions are going to consist in gaps for half of my sequences.

I don't know what is going to hurt the quality of the final tree more : leaving these gaps, which don't really provide information and can lead to "wrong" clusters, or removing these positions, which comes down to removing information for the sequences that did have a base.

I guess another way to ask this question is to ask : is it a bad idea to try and make a tree with sequences of different lengths? Is there a way around these technical issues?

Thanks a lot for your help,

Marine

More Marine Landa's questions See All
Similar questions and discussions