I am going to use Phylogibbs program to study potential TF binding sites in 7 orthologous promoter regions derived from the known genomes of human, chimpanzee, rhesus monkey, rat, mouse, cattle and horse. To run this program I need a phylogenetic tree in Newick format. The guide from the site tells as follows:

“string specifies the phylogenetic tree of the species from which the input sequences derive and must be complete and correct. The tree format that we use is the standard Newick format. However, the values that are normally branch lengths are instead "proximities" in our format. A proximity gives the probability that no mutation has taken place along the branch. The name labelling each leaf of the tree should correspond to a substring that occurs in all fasta-headers of those input sequences that derive from this species (and that does not occur in the fasta-headers of sequences from other species).

For example, the following phylogeny

0.85

-------- human

0.6 |

----------+

| | 0.9

| -------- chimp

Ancestor +

| 0.8

| -------- mouse

| 0.7 |

----------+

| 0.9

-------- rat

would be written as

-L "((human:0.85,chimp:0.9):0.6,(mouse:0.8,rat:0.9):0.7)"

(the string is quoted to protect the parentheses from the shell). Note that the proximities are between successive nodes or leaves: eg, 0.9 for chimp indicates that, for a given neutrally evolving base in chimp, the probability of no substitutions in this base since the common ancestor of chimp and human is 0.9. Proximities are multiplicative: for example, if the set of aligned sequences in the input does not actually contain the sequence for human, the human-chimp node will be eliminated and chimp's proximity to the common ancestor of all species in the tree will be 0.6x0.9 = 0.54. Thus, a single tree may be used that includes species not occurring in the input data."

I am new to the field of phylogenetic bioinformatics and I would like to ask a question – where can I found those “proximities” they mensioned - I could probably count them from distances yes? Where can I found those distances – the genomes are known so this data must exist but I don`t know where to search. Sorry for bothering but I am a total beginner in this field. Can you help me and tell how this Newick format tree for my genomes should look like?

More Tomasz Maciej Stępkowski's questions See All
Similar questions and discussions