I think we have exchanged ideas on this before at a first step you may want to start with creating term document matrix after suitable preprocessing. As a baseline approach you can start with filtering by term weights or Fisher score values. You may use 'tm' and 'Fselector' package of 'R' for ready implementations.