Suppose we deal with a dataset with different kind of attributes (numeric and nominal) and binary class. How can we find a unique number as the shannon entropy of this dataset (as a presentation of kolmogorov complexity of this dataset)?
@Anastasiia Seredkina: Thanks, the point is , suppose we have a dataset with 10 different attributes (nominal and numeric). and 1000 example presenting the Concept. Now, we can calculate shannon entropy for each of the attributes, but what about all together? Should I use Joint distribution of all attributes as the probability function?
Here are some general considerations when computing information theoretic measures, I hope some of them are one the way to answer your question.
The Shannon entropy is a measure for probability distributions. Different assumptions can be made on the distribution to link your dataset (samples of the distribution) to an estimate of the entropy. If you have some prior knowledge on the independence between several attributes, these can be used to approximate the joint entropy more efficiently. If you cannot use this kind of prior knowledge, your estimation of the entropy will require much more samples to converge.
Taking a simple example: if X and Y are random variables, assuming X and Y are independent, it is possible to estimate H((X,Y))=H(X)+H(Y) only using the marginal distributions. The entropy of these marginals are much easier to estimate than the entropy of the joint, meaning it requires much less samples to converge.
More generally, according to the numbers you suggest, I would guess estimating entropy accurately without prior knowledge is not easy (too few samples): if you use a simple plugin estimate (non-parametric estimation of the probability density) it will be strongly biased; alternatively, you can reduce the bias of the estimation, but the variance will then increase.
Again these are general comments for a start, if you actually want to discuss in detail the most appropriate entropy estimation techniques for your dataset, please let us know.
Note the pitfalls in information theory, http://schneider.ncifcrf.gov/pitfalls.html - in particular http://schneider.ncifcrf.gov/glossary.html#Shannon_entropy
If you have 10 different attributes you cannot from a practical viewpoint calculate the full entropy because of the curse of dimensionality without making assumptions regarding the joint distribution you are using. Often an assumption can be made usefully. For example an assumption of a joint Gaussian in many practical cases is a reasonable approximation. Once you make such an assumption then the entropy is a fairly simple function of the covariance matrix for the system which given a sufficiently large sample is accessible. Similar comments apply for other assumed joint distributions.....
The ensemble average for a few data points would indicate the shannon entropy measure. Compelled to express this with an equation
Sum(log (information)) over a set of states.
The entropy measure should be in monotonic convergence indicating information accrual is towards a specific limit or bound for training/learning mechanism.
If this is a machine learning application, you may be better off working with the Vapnik–Chervonenkis (VC) dimension. http://en.wikipedia.org/wiki/VC_dimension
Actually, I want to use the Shannon entropy as a criteria to compare different Concepts i.e. I want to use this to define the notion of Size of a Concept which is present the hardness of training the machine with the Concept. The idea is to use Kolmogorov complexity and since it is impossible to calculate it, I decide to use Shannon entropy as an alternative of Kolmogorov. So, in this setting, do you have any advice? do you know a computer package can calculate?
Absent any further info, the entropy of the solution space is calculated by assuming worst case distribution across the state variables in each dimension. This entropy calculation can be done by hand. If you have additional info about the probability distribution of the data (including joint probabilities, as mentioned by others), then drop those probabilities into the Shannon formula, rather than assuming a worst-case distribution.
The ability of the machine learning algorithm to learn rules for predicting the desired output can be similarly calculated based on the number of degrees of freedom (dimensions) of the algorithm. As suggested above, see VC dimension, which is used to relate degrees of freedom in a machine learning problem to the complexity of the algorithm.
I think you also need to define 'hard' and 'easy'. A complex machine learning algorithm will overfit the data so it is no longer learning, it is just throwing so many degrees of freedom at the problem that you can exactly curve-fit, rather than generalize (i.e. learn). For the purposes of your question, would you count that as hard or easy? I'd say its easy. A harder problem is generalizing from a complex landscape of data using only a few degrees of freedom in the machine learning algorithm, all while maintaining a desired error bound in future predictions. But often a simple transformation of the data can make learning easier. So you need to be clear about what you would consider to be 'hard' or 'easy'. (A different definition of hard and easy might be how many compute cycles or training cycles are needed.)
Lets look at the notion of 'Hard' and 'easy' in this way: training a machine to determine whether there is a circle in an 1000*1000 pixels gray scale picture is easier than determining a cat in the 1000*1000 pixels full color picture of a jungle. It takes less time and less space to make an accurate model for first case than the second one. The thing I am trying to do is to define a measure for this kind of complexity regardless to the Concept Description method. ( I am not sure whether it is possible or not yet)
So if I gave you the Shannon entropy value as requested at the start, what would you do with it? It tells you the randomness in the dataset ... now what? How mathematically would you translate that into hard or easy?
Entropy is a measure of randomness and we indicate ergodicity, the ensemble average to define the entropy over a series of data observations.
One way to obtain a true measure of learning mechanisms is to compute the entropy for a set of observations and therefore indicate a natural progression in the amount of information accrued subject to an upper bound.
Limit (log (pi)) tends to UL for individual datasets and observations for a true learning
If you look for a way to estimate Kolmogorov complexity, you might want to look at compression schemes, an example of use for causality is explained in the following paper (section 7)
http://arxiv.org/abs/1002.4020
and here is the original paper from Lempel and Ziv, defining a complexity measure for sequences