Isn't it statistically naive to consider the keyness of tokens/words as simply the frequency of their occurrences within a corpus?

More Khalid Shakir Hussein's questions See All

How can you separate writing an article for the sake of pleasure and enjoyment and writing an article for the sake of academic promotion?

The hectic epidemic of publishing in highly ranking journals leaves nothing of the pleasure of writing for the sake of writing.

01 February 2019 3,353 3 View

Am I mistaken to think that RG is an environment for academic Research Questions?

Some questions raised in RG are quite irrelevant and make RG sound like Facebook!!!!

11 December 2017 6,148 15 View

In what way does stylistics provide a common ground for a literary reader and a linguistic observer?

What kind of insights does stylistics sustain in a communication between critics and linguists?

08 September 2017 1,145 5 View

Do you find it a little bit weird to view phonetics as an area located outside the territories of linguistics?

Is it still the case that training in phonetics is only a prerequisite to be a linguist?

08 September 2017 1,062 1 View

Do you think it is unforgivable for professors to teach their students without exposing the big picture?

Why does it happen a lot that linguists teach pragmatics without taking into account the philosophy of ordinary language, they teach linguistics without highlighting the long term goal of knowing...

08 September 2017 9,464 3 View

How far readability is reliable in evaluating the loss of translation?

Loss of translation might be scored rigorously if readability and some other statistical measures can be brought together in one basket of evaluation.

07 August 2017 1,307 4 View

Is there any software for calculating READABILITY in Arabic?

07 August 2017 7,902 3 View

Does academic integrity go with raising QUESTIONS for the sake of questions in Researchgate?!

What lies behind the philosophy of having a researchgate account is supposed to be a persistent desire to think of genuine questions rather than fony ones with a hectic preoccupation of improving...

06 July 2017 2,037 1 View

Do you think CONCEPTUAL METAPHORS have any psychological reality?

I mean they are still hypothesized entities with no rigorous verification.

01 January 1970 3,037 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Posthoc test lettering in JAMOVI?

Does anyone know of a module for the JAMOVI software that is capable of generating mean separations using the classic letters based on post hoc results (e.g., Tukey test)? If, as I believe, such...

31 July 2024 3,333 4 View

How to back transform the results generated from analyses using log transformed with In(X+1) data?

I am conducting my analysis using SPSS. I log transformed my data using In(X+1) as my data contain zero values. However, when I want to back transform the regression coefficients generated from my...

31 July 2024 7,860 3 View

Have you tried using Vizly for your data analysis? Use the link: https://vizly.fyi/?via=olatomide. How do you see it?

AI has made it easier to code and analyze data

25 July 2024 9,861 1 View

Is it appropriate for researcher(s) to collapse five or four rating Likert scales to three or two as the case maybe during data analysis?

Five or four rating Likert scales e.g. Strongly agree, agree, neutral, disagree and strongly disagree or Strongly agree, agree, disagree and strongly disagree are usually collapse to SA/A, N, D/SD...

24 July 2024 9,841 4 View

How to test multivariate outlier in STATA?

Hey all, I need help testing for multivariate outliers using STATA for my master thesis. The literature recommends the Minimum Covariance Determinant (MCD) (Verardi & Dehon, 2010). I found the...

22 July 2024 8,821 2 View

Who wants opportunities for scientific cooperation?

Dear Colleagues, I hope this message finds you well. My name is Noor Al-Huda K. Hussein, and I am a researcher specializing in deep learning applications in genetic data analysis. I am currently...

16 July 2024 3,981 6 View

Suggestion for PhD Research Topic/Topics in Applied Statistics?

Hi All I recently get admission in PhD statistics. After a long discussion with my supervisor, the topic I selected for PhD is " Air Pollution and its impact on Economy: A case study of...

15 July 2024 1,820 5 View

What is the difference between OTU and ASV analysis?

For microbiome data analysis

13 July 2024 4,542 2 View

Alessio Martino Popular answer

In my opinion, that is correct. The plain token frequency not only is statistically naïve, but it might also be misleading. For example, in the English vocabulary, words such as "and" or "the" appear quite frequently, although they lack semantics.

That is the main drawback of the so-called "bag-of-words model", where each document is mapped with a feature vector containing the number of occurrences of each word within the document itself. In several bag-of-words models, however, such stopwords are not considered at all.

In order to overcome these problems, more advanced statistics have been proposed, such as TD-IDF where the importance of each word is given by the product of two terms: the Term Frequency (TF) and the Inverse Document Frequency (IDF).

TF is defined as the number of times a given word appear in a given document (as in the plain bag-of-words). The higher TF, the better.
IDF, as instead, measures (in plain terms) whether a given word is frequent across all the documents under analysis. The lower IDF, the better.

So the idea behind TF-IDF is to give importance to within-the-document frequent words, but this importance is weighted to the overall frequency of the word itself. Words like "and" or "the" will have both a high TD and a high IDF value, meaning that their overall TD-IDF measure will be rather low.

Ette Etuk

I agree with you that it is naive to base the keyness of words simply on their frequencies of occurrence. Rather the keyness of a word should be based on its importance in the context. However often frequency can imply emphasis and therefore importance. There are words which function as adjectives and adverbs whose frequencies cannot translate into keyness.

Alessandro Giuliani

You need to complement the 'order independent' statistics (simple frequency count) with a correlation metrics taking into account the between words associato: this wll allow you to shft from 'pure frequancy' to detect words that operate as 'hubs'.

You ca do that in many ways, one s to look for clustering, i.e. you select a distance metrics (the simplest one s the number of words separating two occurrences of word A and word B) and generate word clusters (better if you select in advance a meaningful set of words you base your analysis upon) . The generated clusters represent subset of words that tend to appear together . Those clusters roughly correspond to text parts dealing with a specific concept, the most central word of each cluster is a 'keyword' working as an 'hub' for that specific semantic domain.

James E. McLean

The answer to this question is like the answer to most--it depends. I would not say that it is naive, but I suggest that it depends on your having a reasonable method of determining the importance of a token assuming the importance changes. If you do not know the importance, do not have a reasonably valid method of determining the importance, or if there is no difference in the importance, a simple frequency count would be preferred. If you are able to determine the importance, this measure could be used to improve your results. This could be done by using importance as a covariate or by using them as a factor in a multi-factor analysis.

Alessio Martino