I do several projects involving supervised learning on text and am always interested in finding useful features. For example, some of the ones that are helpful:
- the words themselves: unigrams, bigrams, trigrams
- % of unique words
- readability indices
- vocabulary richness indices
Would anyone be willing to share any text features that they have found useful for supervised learning on text?