I have been looking for a methodology for choosing which words were "key" in the dictionary - the Wordsmyth Dictionary-Thesaurus in particular. Since I'm an educator, I also want to be able to select the most important words for vocabulary study. About 2-3 years ago, I came across several articles that led me to think that a network analysis of the dictionary might lead to a method for selecting keywords by their centrality in the dictionary semantic network.
The task is applying an algorithm to find the words that are central, because the concepts they include link to other words as "small world" hubs. This concept comes from two sources: (a) a suggestion in Michael Stubbs' book Words and Phrases, that the keywords in a language are those that are used in defining other words; and (b) an article by Michael Steyvers that shows dictionaries to be "small world" semantic networks, with hubs that link many other words. So, basically, I would like to find the "central' concepts in English.
My algorithmic and statistical abilities are primitive, but here is the way I have formulated the problem.
I want to find the "centrality" of a word in the Wordsmyth dictionary-thesaurus. Our thesaurus is integrated with the dictionary. It links synonyms with each appropriate definition. So my proposed algorithm involved:
For each word, count:
1. number of definitions
2. number of synonyms
4. Words used in definitions: the number of words (and number of definitions) that use this word (in any of its forms) in a definition
5. Synonym/antonym of the word in a definition
7. frequency (based on a corpus, such as BNC or COCA)
The problem is finding the best statistical concepts for giving weights to these factors.
What I'd like to do is to calculate the "centrality" (importance) of words in our dictionary. What we need is a text analysis similar to that done in the attached article - "Network analysis of dictionaries " (Batagelj, Mrvar, and Zaver). My impression is that we will have to do an analysis similar to the Batagelj article.
I have found some software that might be suitable - http://gephi.org. But I am not sufficiently technically trained to carry out the analysis.
I hope that helps to understand what I want to accomplish.
It sounds like an interesting project. The obvious application that I see is measuring word/phrase similarity. In terms of measuring centrality, I'm skeptical that the results would be significantly different from the much simpler task of tallying frequencies of words and collocations. Is there an example word or phrase that would have a low frequency, but high centrality?
The problem with low frequency lists - even those that eliminate proper nouns, is that there are many that have low frequency, but there is very little ability to discriminate among them. Most rare words are not at all important. The lower frequency words may make reading difficult. But frequency doesn't identify the "important" words that carry important meanings in the language. An example would be "justice". It isn't so frequent, but the idea of justice is important in the meanings of many words which are themselves important. The "small world" analysis would identify words that may not have a lot of links (and may not be frequent), but the links they have are to words that are themselves important (and may be more frequent). The principle is the same as that used by Google to find important/high-ranking pages. A page is more important for Google if it is linked to by pages which are themselves highly linked to, etc.
I'd be interested in seeing how the two approaches differ in the final result set. Just as an interesting note, the word "justice" ranks under 5000 out of over 1 million ngrams in the Web1T corpus.