Hi,
I need to classify a collection of documents into predefined subjects. The classification is based on TF-IDF. How can I determine whether unigrams or bigrams or trigrams...or n-grams would be most suited for this? Is there any formal or standard way to determine this?
Also, how to determine the most appropriate number of features I should consider?
Any help would be highly appreciated.
Manjula.