keyness of tokens/ words should not be exclusively simplified by the statistics of frequency but rather some more sophisticated measures are required such as a reference corpus.
In my opinion, that is correct. The plain token frequency not only is statistically naïve, but it might also be misleading. For example, in the English vocabulary, words such as "and" or "the" appear quite frequently, although they lack semantics.
That is the main drawback of the so-called "bag-of-words model", where each document is mapped with a feature vector containing the number of occurrences of each word within the document itself. In several bag-of-words models, however, such stopwords are not considered at all.
In order to overcome these problems, more advanced statistics have been proposed, such as TD-IDF where the importance of each word is given by the product of two terms: the Term Frequency (TF) and the Inverse Document Frequency (IDF).
TF is defined as the number of times a given word appear in a given document (as in the plain bag-of-words). The higher TF, the better.
IDF, as instead, measures (in plain terms) whether a given word is frequent across all the documents under analysis. The lower IDF, the better.
So the idea behind TF-IDF is to give importance to within-the-document frequent words, but this importance is weighted to the overall frequency of the word itself. Words like "and" or "the" will have both a high TD and a high IDF value, meaning that their overall TD-IDF measure will be rather low.
I agree with you that it is naive to base the keyness of words simply on their frequencies of occurrence. Rather the keyness of a word should be based on its importance in the context. However often frequency can imply emphasis and therefore importance. There are words which function as adjectives and adverbs whose frequencies cannot translate into keyness.
You need to complement the 'order independent' statistics (simple frequency count) with a correlation metrics taking into account the between words associato: this wll allow you to shft from 'pure frequancy' to detect words that operate as 'hubs'.
You ca do that in many ways, one s to look for clustering, i.e. you select a distance metrics (the simplest one s the number of words separating two occurrences of word A and word B) and generate word clusters (better if you select in advance a meaningful set of words you base your analysis upon) . The generated clusters represent subset of words that tend to appear together . Those clusters roughly correspond to text parts dealing with a specific concept, the most central word of each cluster is a 'keyword' working as an 'hub' for that specific semantic domain.
The answer to this question is like the answer to most--it depends. I would not say that it is naive, but I suggest that it depends on your having a reasonable method of determining the importance of a token assuming the importance changes. If you do not know the importance, do not have a reasonably valid method of determining the importance, or if there is no difference in the importance, a simple frequency count would be preferred. If you are able to determine the importance, this measure could be used to improve your results. This could be done by using importance as a covariate or by using them as a factor in a multi-factor analysis.
In my opinion, that is correct. The plain token frequency not only is statistically naïve, but it might also be misleading. For example, in the English vocabulary, words such as "and" or "the" appear quite frequently, although they lack semantics.
That is the main drawback of the so-called "bag-of-words model", where each document is mapped with a feature vector containing the number of occurrences of each word within the document itself. In several bag-of-words models, however, such stopwords are not considered at all.
In order to overcome these problems, more advanced statistics have been proposed, such as TD-IDF where the importance of each word is given by the product of two terms: the Term Frequency (TF) and the Inverse Document Frequency (IDF).
TF is defined as the number of times a given word appear in a given document (as in the plain bag-of-words). The higher TF, the better.
IDF, as instead, measures (in plain terms) whether a given word is frequent across all the documents under analysis. The lower IDF, the better.
So the idea behind TF-IDF is to give importance to within-the-document frequent words, but this importance is weighted to the overall frequency of the word itself. Words like "and" or "the" will have both a high TD and a high IDF value, meaning that their overall TD-IDF measure will be rather low.