Information gain for text mining?

08 August 2014 4 7K Report

Hi Fellow Researchers and Professors,

I am working on the Feature Selection methods for Text Classification. I'm following the famous Yang and Pederson (1997) research paper (attached).

I have successfully implemented the Mutual Information and Chi-Square based on the formulas given in the paper in terms of A,B,C,D,N.

The problem arises as we move to Information Gain. It's formula is given in probabilities and not in terms of A,B,C,D,N. I did not find it anywhere. It would be great if you can provide link to any paper or any other resource that might help.

I know it is already implemented in tools like Weka, but due to some other constraints, I'm coding this myself in Java.

I'm sure many of you can help me in this regard. So please do.

Thanks!

Farhan

Jugurta Montalvao

Dear Hassan Khan,

The definition (equation) of G(t), in Section 2.2 of the paper, has 3 terms (say terms A, B and C). All of them are given by weighted sums of log-probabilities.

Please, recall that 1/(probability of a given symbol), in logarithmic scale, is the amount of information associated to that symbol. Therefore, the weighted sum in term A is the average amount of information associated to the set of all categories of text {c_1, c_2,... c_m}, or its entropy.

In other words, the term A gives a measure of how hard is to guess the category of the text without any hint.

Now, if the term 't' is either absent or present in the text, it can be seen as a hint that may help in this task of guessing the text category, and this is what terms B and C take into account, respectively .

In very practical terms, all you need is to obtain 3 'histograms':

h1: m bins with relative frequencies of texts in categories 1 to m (note that sum(h1)=1)

h2: the same, but considering only texts where the term 't' appears

h3: again the same, but considering only texts where the term 't' does not appear

You also need the overall relative frequency of texts with (P_t) and without (P_not_t) the term 't'. Plainly speaking, the equation of G(t) is equivalent to :

G = -sum(h1.*log(h1))+ P_t*sum(h2.*log(h2))+ P_not_t*sum(h3.*log(h3))

Obs: Note that h1, h2 and h3 are arrays of m estimated probabilities each one, and, in my Matlab-like syntax, the dot-product in [a_1 a_2... a_m].*[b_1 b_2 ... b_m] stands for [(a_1*b_1) (a_2*b_2) ... (a_m*b_m)].

Suggestion: as for histogram (actually Probability Mass Function) estimation, instead of usual normalized histograms, I suggest the use of the Laplace model, as explained in the amazing book by D. MacKay, page 117 http://www.inference.phy.cam.ac.uk/itprnn/book.html). It is equally simple to implement, but far more robust and realistic a model.

Samah Fodeh

I agree with Jugurta, it is a great explanation!

however if you still need those props. in terms of ABCD. you can compute the following for each category:

- (A+C/N)log(A+C/N) + A/(A+B)log(A/(A+B)) + C/(C+D)log(A/(C+D))

you still need to add up these values for all categories in each term , then multiply by the global relative frequency of the existence or absence of the terms t. i.e. multiply the second term by the global P(t) = A+B/N and the third term by P(t') = C+D/N where the A,B,C,D are global counts across all categories.

Farhan Hassan Khan

Thanks a lot for your answers. @Samah I was actually looking for this equation. Thanks to you.

Mustapha Bouakkaz

I agree with Jugurta, it is a great explanation!

What are other tools similar to LibSVM, SVM-Light? I am looking for tools for other algorithms like kNN, Neural Networks, etc.

Journal Rank

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

Do you know best mines of western part of Afghanistan?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

How to convert a privately loaded document into a public document?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

How can I interpret the data without the need of solving it manually?