The primary purpose of the Information Gain is to determine the relevance of an attribute and thus its order in the decision-tree. An attributes (variable) with many distinct values, the information gain fails to accurately discriminate among the attributes.
No. It does not work good for attributes with large number of distinct values (over fitting issue) [1]. To solve this issue the information gain ratio has been proposed. However it has its own problem as well[2]. For a comprehensive view about feature selection methods, their advantages, and disadvantages please refer to [3].
Please note that information gain (IG) is biased toward variables with large number of distinct values not variables that have observations with large values. Before describing the reason of this condition, lets review the definition of IG.
Information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy.
In other words, a variable with the highest number of distinct values probability can divide data to smaller chunks. Also, we know that lower number of observations in each chunk reduces probability of variation occurrence.
Using ID variable in splitting data is a common example for this issue. Since each individual sample has their own distinct value, selecting ID features leads to many clusters with one sample and entropy of zero. Therefore, a decision tree that works with IG, selects the ID as the first separator attribute. Indeed, entropy will approach to zero by selecting the ID feature. However, we are not interested to such a feature. We are more interested to features that highly explain the variation of dependent variable.