I would like to know how to calculate the entropy of a binary word (I can have words of different sizes, 8, 16, 32, 400 bits). I know about the Shannon Entropy, but it is related to a set, not to an individual.
You can calculate letter-level mean Shannon entropy independent or depending on sequence. Sequence-independent mean entropy can be calculated as the Sh = SUM[-(pi)·log2(pi)] where the probs pi for each i-th letter can be determined relative to the frequency of the letter in this text (genome, message, book, etc.) for sequence dependent entropy or graph entropy (sequence is a linear graph) you can use a Markov chain approach to calculate the probs. We have published together with Prof. Cristian R Munteanu and released the software S2SNet to doing both kind of calculations. Please, send me an email if you are further interested on it. See some refs:
1: Munteanu CR, Magalhães AL, Uriarte E, González-Díaz H. Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices. J Theor Biol. 2009 Mar 21;257(2):303-11.
2: Munteanu CR, González-Díaz H, Borges F, de Magalhães AL. Natural/random protein classification models based on star network topological indices. J Theor Biol. 2008 Oct 21;254(4):775-83.
If you have a text with variable length words then you can consider the word with different length is symbol. If we have N words in the text and we have I words then we will have I symbols, then we can calculate the probability PI of occurrence of every symbol by its frequency divided by N.Then we can calculate the entropy of every word length= - LOG2 PI / (PI)
By summing over all possible word lengths in the N word message we can get the overall entropy.
Why do you say nothing about probabilities of the code words or conditional (transition) probabilities from word to word, if they are not independent?
If you don’t have these data, you cannot compute any entropy.
It is unclear sentence “Shannon's Entropy is related to a set, not to an individual”? Entropy of a single individual (particular) message does not exist (is zero). If the source generates only one message (signal, sign, letter or fingerprint...), its uncertainty and entropy is always ero.
If you have probabilities of code words pi (i=1,…,M), proposal of Colleague González (I didn't look whom I am speaking with) is true and SUM[(-pi)x·log2(pi)] defines the entropy per word (i.e. entropty of the source of code words ). Entropy per symbol is : SUM[(-pi)·log2(pi)] / SUM(Li)x(pi), where Li is the length of i-th code (in your case M=4). That’s all what you may compute.
I will read the works suggested by Humbert G. Díaz
Dear Humbert G. Díaz and Abdelhalim abdelnaby Zekry :
The Shannon entropy is defined for a set, group of elements, not for only one element. Using it, as it, can not take into account the internal diversity of the binary word. For example, the binary words 11110000, 10101010 and 11000011 all will have the same Shannon entropy whereas the internal order is very different.
Dear Anatoliy Platonov :
I have binary words that represents the fingerprints of molecules, so, each word is not probabilistic, it is just a codifications of the molecular structure into a set of 1's and 0's. What I need is to compute the entropy of each molecule, in order to know which one is more disorderly. Of course that I could use some modeling tool that compute the "real" entropy of the molecule (MOPAC, GAMESS, Gaussian, etc.), but this task will be very expensive for a great number of systems (and large systems).
I think that this work could help (suggested by David Quesada in another forum):
https://arxiv.org/pdf/1305.0954.pdf.
In this work, the author defined a BiEntropy and a logarithmic weighting BiEntropy taking into account the internal order/disorder of n-bit strings.
The sequence-dependent Shannon entropy of order k-th associated to a Markov chain (Shk) we used is not defined in this way.
In the case of the sequence-based Shk entropies calculated by S2SNet algorithm the three sequences 11110000, 10101010 and 11000011 have clearly different values of entropies.
In fact, the examples you mention are very similar to SNPs (single nucleotide polymorphisms) or intra-chromosome gene orientation patterns for instance. In both cases, we have been able to discriminate this kind of sequences and use Machine Learning to predict external properties (biological function, etc.) of these sequences with the same frequency of letters but different sequence pattern.