Does anyone know about some existing stop-word vocabularies ?
I am interested in doing some keyword text mining work and I was wondering if there are some existing stop-word vocabularies that I can use to reduce the noise from the data.
you can find stopword lists for several languages in the page of snowball stemmer: http://snowball.tartarus.org
for example for french: http://snowball.tartarus.org/algorithms/french/stop.txt
Also, you can generate stopword lists based on frequency lists extracted from corpora. Wiktionary offers a numbers of lists in various languages extracted from different sources:
There are various good answers already given. I would like to add that often times stop words are domain specific, so it may be good idea to build a stop word list for your domain and there are few basic approaches available for the same.
Using Python NLTK, the whole “stopwords” dictionary contains common stop-words from 11 languages. NLTK is a leading open source platform for building Python
programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet.
See Bird S., Loper E. and Klein E., Natural Language Processing with Python. O’Reilly Media Inc. (2009)
***
if you think this answer is useful, please appreciate it using the green Arrow, thanks