For example, how we can show with Kullback–Leibler divergence which the different distance between two prediction and so different probability distribution underline each classification, and how conclusion from that distance?
If you try to methos and want to know which is the best, evaluate them on an independent test set (or perform cross validation). calculate the confidence interval on both scores (e.g. if you use AROC to evaluate classifiers, see the NIPS paper by Corinna Cortes) and apply an appropriate statistical test (e.g. t-test) to see if the difference is sifnificant.
I dont understand the part of your question about KL-diverge. Maybe you use it to compute the deviation between predicted and true probability distribution?
If you are talking about differences in terms of classification capabilities than you can use a X-Cross validation. This statistical analysis allows you to assess the capability of a predictive model (created using a machine learning algorithm) to generalize or, in other words, the capability of the model to classify unseen situations given an independant data set. You can find a lot of information on this method in the litterature. This analysis gives to you an idea on how accurate is your predictive model.
If I understand the question correctly you are asking how one can interpret the D_KL as some form of hypothesis test of a difference between probability distributions? As far as I know (and my background is in biology not statistics), this type of test does not exist. Apart from anything else, since D_KL is asymmetric, except in special cases, for any two probability distributions there are usually two D_KL values depending on which distribution is the reference and which is the comparison. Given this, how would you decide which value to use as evidence of a significant difference? In some applications the choice of which distribution to use as the reference is perhaps obvious - such as when one is comparing a theoretical distribution with a set of empirical observations. However, in your case, where you have two classifiers, both of which are likely to be error-prone, there is no a priori reason to choose one as the reference. If you have a gold standard classification against which to evaluate your classifiers, you could calculate pair-wise D_KL's between your gold standard and each classifier. However, I can foresee two further problems with this. First, as already noted, this isn't an hypothesis test of the existence of a real difference, and, second, it is likely that some classification mistakes are more important than others so even though D_KL will give you an overall measure difference between a reference and comparison distribution, it is averaged over the distribution and so may blur important details. Finally, I'd say that unless you have a particular motivation to use an information theoretic measure of difference, there are many more standard and more easily interpreted approaches than using D_KL. Some of the other answers have mentioned possibilities such as bootstrap methods, ROC, cross classification and so on. I'd explore one or more of those.