Hi,

I need to classify a collection of documents into predefined subjects. The classification is based on TF-IDF. How can I determine whether unigrams or bigrams or trigrams...or n-grams would be most suited for this? Is there any formal or standard way to determine this?

Also, how to determine the most appropriate number of features I should consider? 

Any help would be highly appreciated.

Manjula.

More Manjula Wijewickrema's questions See All
Similar questions and discussions