Now I have the comments of several products in amazon, and I want to apply the lda model. Firstly, I have to construct the word-document matrix. But it is hard for me to make it, becauce I find it hard to select the word. Although I'm school of statistics and Ihave learned java by myself, I don't know how to remove different signs such as","and "%%", and also remove the stopwords. Can anyone recommand me some books or some materials? Thank you!