This equation is related to my earlier examination of shustrings. We can say that for a maximally disordered genome of length four, each of the nucleotides should be found once; all the shustrings are of length one. If the genome is of length 16, then all the shustrings are of length two; remember, order of symbols in a shustring matters. So the chart is:
...seq len...........tile len..........4^seq len..........number of tilings
Also, regarding my earlier described algorithm for finding shustrings, it happens that the algorithm works just like the algorithm for finding the suffix tree. What is different is that a suffix tree is constructed completely out to the end leaf of each path. My algorithm stops just a soon as a shustring is identified. This is to say, my algorithm works by constructing just enough of the suffix tree so as to identify the shustrings but, my algorithm generally does not construct all of the suffix tree. Rather, the constructed suffix tree is in general severely pruned of all its leaves. The algorithm of Haubold, et al. does too much work in identifying shustrings; more efficient algorithms exist. I suggest my algorithm is one such more efficient algorithm, as compared to the Haubold algorithm.
I am not sure. The log of this quantity is supposed to be the entropy.
If " ^" denotes exponentiation then the last factor can be dropped because it evaluates to 1. Further, the first and the third terms can be combined. Is this correct? Good luck.
First, the shortest sequence in which each nucleotide appears once.
This sequence is four nucleotide residues long, and if we consider it to be a necklace [see: http://en.wikipedia.org/wiki/Necklace_%28combinatorics%29, and look under the section Examples to see the relation (n - 1)!], then we see that for A C G T, there are only (4 - 1) ! or 3 * 2 * 1 possible arrangements; save for reversals, there is no duplication in these. This is shown in the above table.
If we allow AAAA and TGGT, etc, then we allow for 4 ^ 4 possible arrangements, but these do not represent the greatest sequence entropy. The greatest sequence entropy occurs in only 6 forms, though I like to remove the reversals, since the molecule floats in 3D.
Second, consider the case where the tiles are all of length two. Thus, the shortest sequence that can represent all 16 of these pairings is itself of length 16.
For instance, AA AC AG AT ... TG TT can all be represented as:
AACC AGGT CTGC GATT
It is easy to see that all the shustrings are of length two. For a sixteen symbol long string made of a symbol set of size four, the total number of arrangements is 4 ^ 16 but there are only 82,944 of these sequences that have all the shustrings of length two; every other sequence has shustrings of length strictly greater than two.
Hence, only when the shustrings are all short and of identical size is entropy of the sequence maximised.
I think it should also be considered that increases in shustring length (beyond the minimal sizes listed in the above table) represent redundancies, and so represent less entropy as compared with sequences that comply with the table.
Another way to approach this issue begins with the observation that there is no distinction (especially from a local viewing) between one part of a mono-nucleotide residue polymer and another part of that same molecule, with the problem becoming greater in direct relation with increasing length of the polymer.
This is not the case when one considers a sequence of optimal shustrings. Generally, even local viewings between two arbitrary parts of such a molecule will be strikingly distinct,.
Let's give the example for tiles of length three. The tiles are:
AAA AAC AAG AAT
ACA ACC ACG ACT
AGA AGC AGG AGT
ATA ATC ATG ATT
CAA CAC CAG CAT
CCA CCC CCG CCT
CGA CGC CGG CGT
CTA CTC CTG CTT
GAA GAC GAG GAT
GCA GCC GCG GCT
GGA GGC GGG GGT
GTA GTC GTG GTT
TAA TAC TAG TAT
TCA TCC TCG TCT
TGA TGC TGG TGT
TTA TTC TTG TTT
It is fairly easy to see how the equation is derived.
Start with any tile; we elect to use AAA.
Now, I can add only one nucleotide at a time to the sequence, and
so the next tile will be AAC or AAG or AAT. This is to say, I have only
three choices for the tile I may use; I elect to use AAC. Thus, the sequence starts as:
?..AAA
3.....AAC
For the next tile, I see that ACA, ACC, ACG and ACT are available; I have four choices. I elect to use ACT. Thus, the sequence now looks like:
?..AAA
3.....AAC
4........ACT
One can already see the pattern. For each leading two symbol pair of a tile, it happens that at one time I will have four choices, then latter I will have three choices, still latter having just two choices and finally late in the game there will be a case of only one choice. It is not hard to see how the exponents on these choices is derived, as well; left for a reader exercise.
I think the maximal disorder of sequences produced by this method, as compared to the disorder of all other (in this case) 64-mers, is obvious but, it seems not to be discussed anywhere.
You appear to be addressing your own question via multiple posts. Might you indicate where this equation, simplified as f(n) = (4!)^(4^n)/4, comes from, e.g. with a citation?
I derived this equation, as hinted in the post (next to last paragraph) just above.
What is described is a view, an approach to understanding disorder in a sequence. My modest mathematical abilities make simplification of the equation difficult, and so I thank you for producing same.
Part of the point of the post is to find any available citation; I know of none.
So, these posts are in part designed to elicit contact with others who have interest in this topic.