Suppose I am looking at millions of data sets each with millions of data points and I need to capture details about each of those distributions with as much accuracy as possible. Histograms are a concise way to capture information about the distributions so that one can construct a CDF or calculate approximate quantiles at a later time from the stored histograms, and they can efficiently be calculated over many computers in parallel for large data sets.

What statistical methods best capture the information loss for a given set of histogram breakpoints for a given empirical distribution?

For example, suppose I have the data set 1,1,1,1,1,9,9,9,9,9. Histogram 1 uses breakpoints 0,5,10 and Histogram 2 uses breakpoints 0,2,4,6,8,10

So histogram 1 looks like :

[0,5] : 5

[5,10]: 5

Histogram 2 looks like:

[0,2]: 5

[2,4]: 0

[4,6]: 0

[6,8]: 0

[8,10]: 5

Clearly Histogram 1 has more information loss than histogram 2 since the bimodal nature of the underlying distribution is lost with the unfortunate breakpoints chosen in histogram 1 compared to the breakpoints in histogram 2 which show the bimodal nature of the underlying distribution.

Since I don't know if the underlying distribution is normal, I am currently using a worst case metric which essentially generates the worst possible distributions that could be represented by the same histogram and takes the Kolmogorov-Smirnoff statistic (or just the maximum distance apart of the two CDFs approximated from the histograms, as represented by the yellow boxes in the right most column of the attached plots).

Do any statistical software packages calculate KS or information loss metrics directly from histograms? Are there other methods besides KS which capture this information loss? I couldn't find anything for R on CRAN.

Similar questions and discussions