The classic approach to term filtering in information retrieval is to filter out terms at both ends. Terms that are very infrequent (rare terms) do not contribute much to concept-driven language patterns and are typically removed from the analysis through some kind of frequency filtering (e.g., remove all words that appear in fewer than 5 documents in the collection). This technique keeps the total number of terms to something like 5,000-10,000. Similarly, terms that are very frequent (trivial terms) are also removed as stopwords. These are typically about 500, this reduces the number of terms to about 4,500-9,500. A third technique is term stemming, where words with a common root are conflated. This may reduce the term size to about 2,000-6,000 stemmed terms. More drastic reductions (down to 200-1,000 terms) require the identification of the most USEFUL terms. This is a harder problem. One approach is to use the TF-IDF or similar function (e.g. Log-Entropy). Another approach is to identify terms that are highly associated with high-order principal components, after a preliminary run of Latent Semantic Analysis.
I think it depends on the goal you pursue. If it is, for example, text compression, I would take the most frequent words, since with those I would achieve a greater compression rate. If your goal is, say, document clustering, I'd use TFIDF-based vocabulary, since these values offer the possibility to define "similarity" between documents.
You have not mentioned about the purpose of vocabulary. If you are designing a some kind of training set then it is worthwhile to have the valuable words and for designing a hacking related programs you have to go for complete vocabulary including valuable, most frequent etc.....
Since you tagged the question with "Information Retrieval" I assume that is your purpose. There is a trade-off between descriptive words (frequent and meaningful) and discriminative words (not frequent and meaningful).
The meaningfulness concept is quite important: words in closed grammatical clases (prepositions, articles, adverbs) contain much less meaning and are usually discarded (stop words).
TF-IDF tries to model this two concepts: for a document, frequent words are weighted higher; however, if these words are frequent in all documents, then their weight is lowered.
Even if the Information retrieval purpose is clear, you still need to think about favouring precision or recall and the vocabulary choice may have an impact on that.
I would finally suggest to perform some experiments on a training set and then decide what works best for your use case.
The classic approach to term filtering in information retrieval is to filter out terms at both ends. Terms that are very infrequent (rare terms) do not contribute much to concept-driven language patterns and are typically removed from the analysis through some kind of frequency filtering (e.g., remove all words that appear in fewer than 5 documents in the collection). This technique keeps the total number of terms to something like 5,000-10,000. Similarly, terms that are very frequent (trivial terms) are also removed as stopwords. These are typically about 500, this reduces the number of terms to about 4,500-9,500. A third technique is term stemming, where words with a common root are conflated. This may reduce the term size to about 2,000-6,000 stemmed terms. More drastic reductions (down to 200-1,000 terms) require the identification of the most USEFUL terms. This is a harder problem. One approach is to use the TF-IDF or similar function (e.g. Log-Entropy). Another approach is to identify terms that are highly associated with high-order principal components, after a preliminary run of Latent Semantic Analysis.