Decision trees have the property that they provide both a prediction and a probability for this prediction (scikit-learn's predict_proba method; Section 3.4 of Data Mining with Decision Trees, 2nd edition), which is basically the proportion of the predicted classes among the samples at the corresponding leaf. This probability is more an indication of the tree's (un)certainty about its prediction than the true probability (like with density estimation techniques), for which I am using this value.
Do you know about theoretical bounds on this quantity? (Minimum value, probability distribution between the minimum value and 1 would be very interesting for me.) More formally, denoting by p^
the posterior probability of the sample belonging to the predicted class (a random quantity), p^=P(y^|x), what would be the probability distribution of p^ (or the way to compute P(p^≤T) for varying T)? I guess there must be some approximation in there (there is one value per leaf, i.e. p^
has a discrete distribution over a small subset of the rationals). Something like a generalisation bound based on VC dimension, for instance (it is available for the predictions, of course, but I do not see how to generalise it to the uncertainty).
I have done quite a lot of research on the topic, but have found nothing of use in my case. Some people advise to use things like Platt scaling or isotonic regression, but these seem to be more suited to density estimation based on the algorithm's output.
A very basic lower bound would be like 1/N
for N samples, supposing that all classes have the same number of samples at the leaf (neglecting the samples that went to other leaves); the upper bound would be 1 for leaves with no impurity. However, this gives no indication whatsoever on the distribution of values between the two extremes. (I could compute that distribution on one tree, but it is hard to write a proof based on experimental results.)
(If you know about any similar result for other machine-learning algorithms, I'm also very interested!)
Thanks a lot for your time!