Stopwords means the most commonly used words in particular language on which you are working. Most of the times we find 20% to 30% of text is stopwords in our normal text document. Stop words can cause problems when searching for phrases that include them. Stopwords are filtered out from the text when we process the text. and density of word is calculated as below:
Density = (frequency(word)/count(word))*100
frequency of word :- number of occurrences of word in document.
count(word) :- total number of words
Token count is the count of tokens from the text i.e. number of basic unit which you have described for your language mostly it is word.
Any word consisting of either 1 or 2 characters won't be of any significance, so we remove all of them.
To remove stopwords, we first need to detect the language. There are a couple of ways we can do this: -
Checking the Content-Language HTTP header - Checking lang="" or xml:lang=""
attribute - Checking the Language and Content-Language metadata tags If none of those are set,
You will need a list of stopwords per language, which can be easily found on the web.
http://www.ranks.nl/resources/stopwords.html
Try doing it in python or R, it will be more easy for you.
I'd like to respectfully complicate the way we're talking about stopwords. What constitutes a stopword is dependent on the corpus you are working on, and the goal of your research. If my approach is primarily semantic, and I want to index what texts are about, then there is a good chance that many of the very common in my corpus--what Hope & Witmore (2010) call the "gloop" of language--isn't of interest, and I want to filter them out. In that case, I may want to filter out gloop so I can find "plums." But if I am doing pragmatic work, for trying to detect latencies in text such as affective or epistemic stance, then the gloop may be critical to my questions. Biber, Conrad, & Reppen (2000) point out that human readers naturally detect plums and ignore gloop, and thus machine-based corpus approaches have an important advantage in doing truly accurate empirical work. I looked at the stopword lists Mayur pointed to, and for the kind of work I do, almost everyone of those words has some important function, either alone of in a larger lexical bundle, that I want to detect and count.
Ultimately, you will need to decide on a case by case basis what is of interest and what is not of interest, in your text analysis. I think this issue reflects larger divisions between linguistics and computer scientists in our theory and assumptions about language use and how it can be investigated.
Stopword density analyses the words that are repeated more times in our programming. It causes the search engines to confuse display information regarding whih keyword.
Token count returns number of tokens(Smallest units of a program) in your text.