12 December 2014 10 10K Report

I want to use the 20 newsgroup datasets to test an algorithm, and analyse the significant words for each group.

I found the 20 newsgroup datasets in two websites,

The first is in the website provided by The University of Toronto(http://www.cs.toronto.edu/~larocheh/public/datasets/20newsgroups/20newsgroups_train_binary_5000_voc.txt). But I can't find the correspond vocabulary file for this dataset.

Second is in the official website(http://qwone.com/~jason/20Newsgroups/), this contains the necessary 20 newsgroups data and the vocabulary, but I have to exclude the Stopwords and replace the words with their stems by hand, I downloaded an english.stopword file and modified the 20 newsgroups words. There are still a lot of words like, 'mg, huh, sgi ...' in the dataset.

So, is there anyone who can shed some light? Where can I find the corresponding vocabulary for the first option?

Similar questions and discussions