Good day! I have a list (~10000) of unique DNA sequences about 10-20 bp.
I want to find out if they could evolve from one or several sequences, or emerged independently.
Some of the sequences have similar motifs and could be aligned, others haven't at all - so I can't just perform MSA and make a tree - the distance matrix contains many NAs.
I've tried using principal components analysis on k-mers (1-4) frequencies but it gives me nothing - the frequencies form one dense cloud of points with PC1 that have only ~4% explained variance.
And I found that universalmotif R package is capable of performing similar analysis using motif_comparison(), so I converted the sequences into sequence motif format (one for each), but when tried it on a short set of data - found that the algorithm works in a very strange way on list of motifs created each from only one sequence. Different methods gives the same result (added tree to the question) - the sequences that are different are placed near instead of sequences that are someway similar...