I calculated the entropy of my dataset, but I want to compare it to the entropy of MNIST dataset to find an explanation for the results that I obtained.
Yes, this is the usual expression for entropy, but that leaves you with the question how you define pk and which states k you sum over. I can imagine something for a database of letters. Suppose my letter is a 10x10 pixel array (just black and white pixels) then there are 2100 (~1030) different pixel arrays, and maybe 105 of those are recognizable as an A, so I have my probability for finding a particular letter (could be a little larger if I take rotations and positions into account, but will still be very small compared to all possibilities), and I can sum over all letters. But I could also say: I know it is a letter, what is the probability that it is an A? The entropies would be completely different. I assume you did something like this, if not please explain what you did. Some letters may also be more recognizable than others. But do I actually learn something from that?
But how would you do something like that for faces? Does a smiley count as a face? Or do you just want the probability that some arbitrary face is a celebrity? You said you calculated it, so what is the answer, and what pk did you use? And for what do you need an explanation?
Maybe I am completely on the wrong track, being rather unfamiliar with this field, just curious. I reacted because your question has the word entropy in it. It reminds me of a statement attributed to von Neumann in a discussion with Shannon: "My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons: In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."
In statistical mechanics we know exactly what is meant by pk. In information theory I have often no idea, except when it comes to bits.
During the process of deriving the so-called entropy, in fact, ΔQ/T can not be turned into dQ/T. That is, the so-called "entropy " doesn't exist at all.
The so-called entropy was such a concept that was derived by mistake in history.
It is well known that calculus has a definition,
any theory should follow the same principle of calculus; thermodynamics, of course, is no exception, for there's no other calculus at all, this is common sense.
Based on the definition of calculus, we know:
to the definite integral ∫T f(T)dQ, only when Q=F(T), ∫T f(T)dQ=∫T f(T)dF(T) is meaningful.
As long as Q is not a single-valued function of T, namely, Q=F( T, X, …), then,
∫T f(T)dQ=∫T f(T)dF(T, X, …) is meaningless.
1) Now, on the one hand, we all know that Q is not a single-valued function of T, this alone is enough to determine that the definite integral ∫T f(T)dQ=∫T 1/TdQ is meaningless.
2) On the other hand, In fact, Q=f(P, V, T), then
∫T 1/TdQ = ∫T 1/Tdf(T, V, P)= ∫T dF(T, V, P) is certainly meaningless. ( in ∫T , T is subscript ).
We know that dQ/T is used for the definite integral ∫T 1/TdQ, while ∫T 1/TdQ is meaningless, so, ΔQ/T can not be turned into dQ/T at all.
that is, the so-called "entropy " doesn't exist at all.
.
.
.
Why did the wrong "entropy" appear ?
In summary , this was due to the following two reasons:
1) Physically, people didn't know Q=f(P, V, T).
2) Mathematically, people didn't know AΔB couldn‘t become AdB directely .
If people knew any one of them, the mistake of entropy would not happen in history.
Please read my paper and those answers of the questions related to my paper in my Projects.