In many case we approximate histogram of a data set as the probability density function. Can we quantize the difference between the PDF and the histogram? Is there analytic expression to describe this difference? Thanks,
If the number of samples one constructs the histogram from is approaching infinity, the normalised histogram (such that the frequencies in the intervals are divided by the number of samples) is approaching the probability density function. In case of a continuous distribution, it is a stepwise approximation so that the normalised frequency in an interval approaches the integral of the density function in that interval.
It can be considered as a special case of stochastic convergence.
Histogram can explain the probabilistic distribution type and parameter of the location,scale, shape and other properties as Skewness,Kurtosis and so on.
If the number of samples one constructs the histogram from is approaching infinity, the normalised histogram (such that the frequencies in the intervals are divided by the number of samples) is approaching the probability density function. In case of a continuous distribution, it is a stepwise approximation so that the normalised frequency in an interval approaches the integral of the density function in that interval.
It can be considered as a special case of stochastic convergence.
It is an interesting question . The histogram is based on bins and frequency in bins.
If the bins are made narrow we obtain the case of near zero probability (at a given point
in continuous rv the probability is zero). The other extreme of very broad bin would yield a deterministic variable. From this we may proceed as follows :
1 Choose some " best " bin division which would give maximally smooth polynomial fit.
this polynomial would then be the desired fit.
2 Model the data with some known pdf,s depending on the problem which has close
resemblance to the histogram. In this case perhaps the model pdf with least error squared would likely yield the result.
Your question is not simple, as it depends on the goals and assumptions of the analysis. The suitable distances/divergences between your pdf and sample points can change accordingly. The answers you got reflect this situation. For instance, if you accept that your (estimated) pdf is the target model and you try to compare with an independent sample, you are in the situation of a goodness-of-fit testing. For instance, if your interest is focused on the central frequencies, misfit of the tails are not so important, and your sample size is small to medium (e.g. n=10, ...,300), the Kolmogorov-Smirnov test may solve your problem as suggested by C. J. Albers. The suggestion by P. Teran is related to the failure of Kolmogorov-Smirnov test when using large samples.
However, if your interest is on low probability intervals (tails) Kolmogorov-Smirnov distance will describe differences very poorly. Then, strategies as Peason-chi2 test statistic gives a reasonable divergence at the price of defining intervals (bins) for both the interpolated pdf and the sample. The difference computed in this way will be more sensible to departures on the tail. Once the pdf has been discretized on intervals, you can also use the chi2-likelihood ratio test for multinomial goodness-of-fit with similar results to those from the Pearson-chi2. All these are approaches pointing to a goodness-of-fit testing.
Maybe you are not interested in testing but just on well founded distances. An easy interpretable approach is that proposed by Huda, just comparing the first moments of the sample and the pdf, but you get several measures of difference and no trivial way of unifying them. Let me propose the following compositional approach.
Take n intervals in which you get m (e.g. 1,2,...) data points in each one, so that these intervals are equally probable (m/n) across your sample. Keep the number of these intervals a moderate number (e.g. n=10, 20, 30). Now compute the relative frequency of points of the initial sample in the intervals. Both arrays of frequencies (m/n, m/n, ...,m/n) and (n_1/n, n_2/n, ...,n_m/n) are compositions and they live in the unit simplex (constant sum equal 1). In the compositional approach, a distance, called "Aitchison distance" (Aitchison 1984),is available. It can be easily computed through the so called centered-log-ratio (clr) coefficients: for a composition x=(x_1,x_2,...,x_m),
clr(x) = log(x_1)-k, log(x_2)-k, ..., log(x_m)-k
where k = (1/m) Sum_i log(x_i). Now, the Aitchison distance is computed as the Euclidean distance of the clr vectors. Note that the clr of (m/n, m/n, ...,m/n) will be (0,0,...,0).
This kind of distance has interesting theoretical properties. Among them: scale invariance (it does not depend on the units of frequency, proportions, percentages, parts per million,...); subcompositional coherence (if a subset of intervals is considered for both compositions, the Aitchison distance cannot increase); perturbation invariant (if you change units in each interval, e.g. if your counts in the first-interval, for both compositions, are multiplied by 5, the Aitchison distance between the two compositions does not change).
This approach provides a distance which is very sensible to the presence of low frequencies. Zeros in count data should be avoided by an appropriated estimator of the frequency. It can be used for goodness-of-fit testing but require Montecarlo simulation.
References
Aitchison, J (1984): Reducing the dimensionality of compositional data sets, Mathematical Geology, 16,6,
617-636.
Egozcue, J. J. and Pawlowsky-Glahn, V. (2011): Basic concepts and procedures, ch 2 in: Pawlowsky-Glahn, V. and Buccianti A. (eds.), Compositional Data Analysis: Theory and Applications, Wiley, Chichester UK
Juanjo, when I recieved a notification that you had posted a remark, I asked myself: Well, what could the connection to compositional data possibly be?
1. The physics of the original problem. Some cases e.g one sided exponential may not have a left handed tail which might simulate a crowd already waiting to enter opening of a store.
2. it might be possible to fit a linear combination of known pdfs of different types keeping properties of pdf in mind.This will perhaps enable lesser errors as we have more variables to fit the data.