Does anyone know of a proof of the fact that by coding a memory-less information source X, the resulting information source Y has memory and H(X)=L*H(Y) where L=average code length?
You've poorly formulated the task. What bothers you?
If you know entropy of the source H(X) [bits/message] (I hope you consider discrete sources), and H(Y) is the mean number of bits per symbol of code, than L=H(X)/H(Y) according to definition. No proofs.
Indeed, I consider discrete sources. However, "according to definition", L=sum{p(xi)l(ci)}, where p(xi) = probability of the message xi of the source X and l(ci)=length of the codeword ci corresponding to xi. I need to characterize Y as a source with memory (states, matrix of the transfer probabilities between states, matrix of the steady-state probabilities of the states), to compute H(Y) and to verify the relation L=H(X)/H(Y).
I think this is a nice question, I have been looking for something similar a while ago and don't remember finding a clear reference on this topic. For sure what you ask can be proved by simply using the A.E.P. (you may need to consider also the typical length in this case for the process Y). However, what I was looking for was a detailed study of the entropy of the first n output symbols. If you start to play around with this, you quickly fall into the same recurrence relations that are also used in the proof of the McWilliam theorem, but the nice thing is that they are stated in the logarithmic domain. The most useful reference that I found was in the Csiszar-Korner book (Ch. 4 in the new edition, but you have to go into the details of the proof of Th. 4.1, see eq. (4.8)).
What I find interesting is that this topic was somehow already considered by Shannon in his '48 paper under the context of noiseless channels. His results are extremely general and powerful. For example, as mentioned in the report that I also cited in the other comment, Shannon's result already implied McMillan's theorem and even a more general theorem for the case of sources with memory (see the same report that I mentioned in my other comment http://www.ing.unibs.it/~marco.dalai/pub/DL_TR_2008-01-51.pdf)
Thank you, Marco, for your useful comments and your report. Concerning your problem, the entropy of the first n output symbols must be H(Y^n) = nH(Y). As L=H(X)/H(Y), the entropy is H(Y^n) = nH(X)/L. Do you agree?
Hi Gheorghe, I don't think this is true for finite n, it only holds asymptotically. In your original question I assumed with H(Y) you were referring to the entropy rate of the process Y, because it is in general a process with memory (as you say). For finite n, the entropy of the first n output symbols really depends on the code. Trivially, if all the codewords start with the same symbol (this is clearly non-optimal, but still possible), then the entropy of the first code symbol is zero, not matter what the entropy of X is.
Yes, the entropy of the first zero symbol is zero. The average length L' of the codewords is increased with 1 symbol/codeword (L' = L+1). Hence, the entropy H'(Y^n) becomes smaller. However, the product L' * H'(Y^n) does not change, since L*H(Y^n) = L' * H'(Y^n) = nH(X).
Marco, you are right, all my relations hold asymptotically, not for the first n output symbols. However, I could compute the entropy Hn(Y) for the first n output symbols (just for an example...) and I obtained (as expected...) that Hn(Y)/n -> H(Y), where H(Y) holds asymptotically. I think your problem is very interesting. Did you have a practical application for this problem?