Is there any analytic expression describing how a histogram is close to a PDF?

12 December 2014 10 9K Report

In many case we approximate histogram of a data set as the probability density function. Can we quantize the difference between the PDF and the histogram? Is there analytic expression to describe this difference? Thanks,

Ernő Keszei Popular answer

If the number of samples one constructs the histogram from is approaching infinity, the normalised histogram (such that the frequencies in the intervals are divided by the number of samples) is approaching the probability density function. In case of a continuous distribution, it is a stepwise approximation so that the normalised frequency in an interval approaches the integral of the density function in that interval.

It can be considered as a special case of stochastic convergence.

Huda A. Rasheed

Histogram can explain the probabilistic distribution type and parameter of the location,scale, shape and other properties as Skewness,Kurtosis and so on.

This URL may be helpful.

http://oldweb.mit.bme.hu/books/quantization/basics-of-theory.pdf

Best Regards

Pedro Terán

Those results are called local limit theorems or local central limit theorems, see e.g. the following link.

http://www.math.unl.edu/~sdunbar1/ProbabilityTheory/Lessons/BernoulliTrials/LocalLimitTheorem/locallimittheorem.pdf

Casper J Albers

Another option is to look at the Kolmogorov-Smirnov distance: the maximum vertical distance between the empirical and theoretical cdf.

Ernő Keszei

It can be considered as a special case of stochastic convergence.

Narasim Ramesh

It is an interesting question . The histogram is based on bins and frequency in bins.

If the bins are made narrow we obtain the case of near zero probability (at a given point

in continuous rv the probability is zero). The other extreme of very broad bin would yield a deterministic variable. From this we may proceed as follows :

1 Choose some " best " bin division which would give maximally smooth polynomial fit.

this polynomial would then be the desired fit.

2 Model the data with some known pdf,s depending on the problem which has close

resemblance to the histogram. In this case perhaps the model pdf with least error squared would likely yield the result.

Cheers

Juan Jose Egozcue

Dear Huang,

Your question is not simple, as it depends on the goals and assumptions of the analysis. The suitable distances/divergences between your pdf and sample points can change accordingly. The answers you got reflect this situation. For instance, if you accept that your (estimated) pdf is the target model and you try to compare with an independent sample, you are in the situation of a goodness-of-fit testing. For instance, if your interest is focused on the central frequencies, misfit of the tails are not so important, and your sample size is small to medium (e.g. n=10, ...,300), the Kolmogorov-Smirnov test may solve your problem as suggested by C. J. Albers. The suggestion by P. Teran is related to the failure of Kolmogorov-Smirnov test when using large samples.

However, if your interest is on low probability intervals (tails) Kolmogorov-Smirnov distance will describe differences very poorly. Then, strategies as Peason-chi2 test statistic gives a reasonable divergence at the price of defining intervals (bins) for both the interpolated pdf and the sample. The difference computed in this way will be more sensible to departures on the tail. Once the pdf has been discretized on intervals, you can also use the chi2-likelihood ratio test for multinomial goodness-of-fit with similar results to those from the Pearson-chi2. All these are approaches pointing to a goodness-of-fit testing.

Maybe you are not interested in testing but just on well founded distances. An easy interpretable approach is that proposed by Huda, just comparing the first moments of the sample and the pdf, but you get several measures of difference and no trivial way of unifying them. Let me propose the following compositional approach.

Take n intervals in which you get m (e.g. 1,2,...) data points in each one, so that these intervals are equally probable (m/n) across your sample. Keep the number of these intervals a moderate number (e.g. n=10, 20, 30). Now compute the relative frequency of points of the initial sample in the intervals. Both arrays of frequencies (m/n, m/n, ...,m/n) and (n_1/n, n_2/n, ...,n_m/n) are compositions and they live in the unit simplex (constant sum equal 1). In the compositional approach, a distance, called "Aitchison distance" (Aitchison 1984),is available. It can be easily computed through the so called centered-log-ratio (clr) coefficients: for a composition x=(x_1,x_2,...,x_m),

clr(x) = log(x_1)-k, log(x_2)-k, ..., log(x_m)-k

where k = (1/m) Sum_i log(x_i). Now, the Aitchison distance is computed as the Euclidean distance of the clr vectors. Note that the clr of (m/n, m/n, ...,m/n) will be (0,0,...,0).

This kind of distance has interesting theoretical properties. Among them: scale invariance (it does not depend on the units of frequency, proportions, percentages, parts per million,...); subcompositional coherence (if a subset of intervals is considered for both compositions, the Aitchison distance cannot increase); perturbation invariant (if you change units in each interval, e.g. if your counts in the first-interval, for both compositions, are multiplied by 5, the Aitchison distance between the two compositions does not change).

This approach provides a distance which is very sensible to the presence of low frequencies. Zeros in count data should be avoided by an appropriated estimator of the frequency. It can be used for goodness-of-fit testing but require Montecarlo simulation.

References

Aitchison, J (1984): Reducing the dimensionality of compositional data sets, Mathematical Geology, 16,6,

617-636.

Egozcue, J. J. and Pawlowsky-Glahn, V. (2011): Basic concepts and procedures, ch 2 in: Pawlowsky-Glahn, V. and Buccianti A. (eds.), Compositional Data Analysis: Theory and Applications, Wiley, Chichester UK

Pedro Terán

Juanjo, when I recieved a notification that you had posted a remark, I asked myself: Well, what could the connection to compositional data possibly be?

But of course you're right.

Narasim Ramesh

It may be useful to consider the following:

1. The physics of the original problem. Some cases e.g one sided exponential may not have a left handed tail which might simulate a crowd already waiting to enter opening of a store.

2. it might be possible to fit a linear combination of known pdfs of different types keeping properties of pdf in mind.This will perhaps enable lesser errors as we have more variables to fit the data.

Cheers

Sandip Ladi

central limit theorem will do.

Any IEEE Journal for simulation only based RF/baseband circuits article?

Dirty Paper Coding is practical?

What's the Sampling Period of the ECG signal in MIT-BIH database?

How is can I prove the MMSE estimation of signal in Gaussian noise in linear form?

Does anyone know the latest reports on spectrum of quantization error?

How can I integrate the degenerate multidimensional Gaussian?

What is closed form of the determinant of finite order covariance matrix?

Is each tap of ITU channel model necessary to be Rayleigh/Rician distributed?

Is X(t)=B(t)N a spherically invariant process?

Does the Wiener Filter work in arbitrary signal and additive noise?

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

"A Markov-like Model for Patient Progression"?

La animación digital en plataformas digitales?

Where to download DEM from 2018?

Simulation of metal drawing by Abaqus with UMAT?

Recent topic in digital banking?

Patronage margin difference between offline womens'apparel store and its equivalent online.Is this a good research topic?

Does post-translational protein modification cause devisions on observed pI verses calculated pI?

How to measure the link between a School's digital transformation and competitive advantage?

Could international experts in media and journalism, AI, communications and international communication share this conference link?