H is a function of the probabilities, and only of the probabilities. So the error in H must obviously be a function of the errors in estimates of the probabilities, and of nothing else. If the probabilities are known exactly, then there can be no error in H, the sum of N PilnPi's. If the error in each PilnPi is E, then the absolute error in H, the sum of N PilnPi's, would be E[SQRT(N)], and the relative error in H, the sum of N PilnPis, would be E/SQRT(N).
Estimation of the Shannon entropy from a sample is an active research area. The plug-in estimator simply calculates the Shannon entropy of the empirical distribution and use it as an estimate of the Shannon entropy of the distribution that is generating the observations. Since x*ln(x) is a strictly convex function the plug-in estimator is a biased estimator and often the size of bias is more of a problem than the size of the variance of the estimator. Therefore various less biased estimators have been proposed. For details see for instance:
Zhang, Z.; Grabchak, M. Bias Adjustment for a Nonparametric Entropy Estimator. Entropy 2013, 15, 1999-2011.
Jiantao Jiao, Kartik Venkat, Yanjun Han, Tsachy Weissman, Minimax Estimation of Functionals of Discrete Distributions. IEEE Transactions on Information Theory May 2015 , vol. 61, no. 5, pp 2835-2885,
A big advantage of discriminating information measurement over entropic information measurement is that errors can easily be recalibrated out of the problem. If you compute the instananeous discriminating information (the LLR) based upon a hypothesized density pair, and then later decide that density pair you used does not accurately represent the true probabilistic description of the underlying random variable, then all you have to do is compute the LLR of your original LLR statistic (which is no longer a true measure of discriminating information), and however much the value decreases is a direct measure of how much the error is hurting you in terms of discriminating information content. This is because the LLR possesses the critical property of Self Scaling - for an LLR that is properly computed (i.e, where the correct densities were used in the first place), it is guaranteed that the information content computed from the LLR (when considered as a random variable) [you do this by computing the two densities for the LLR statistic you have using the standard rules of random variable transformations and then compute the LLR of these two densities] must in fact precisely return the original LLR value at each and every possible value it might take on.
Thank you very much for answering my question about the errors in computing information entropy. Special thanks to Peter Harremoës for sending me links to the articles.