Here is an extract from my draft ppt on "Art of searching and science of retrieval". See the image for formula. Will it make any sense? if not, forgive me.
Term Analytic:
• TF (Term Frequency): the number of occurrences of the term in the document excluding the stop words; More the occurrence of a term in the document more relevant it is to a query with that term
• DF (Document Frequency): the number of documents in the collection that contain a term
• CF (Collection Frequency): the number of occurrences of the term in the entire collection
• It is better to use document level scoring like DF than collection level statistics like CF
• DF and CF tend to rank the document with very high for frequently occurring common words and hence do not serve the purpose of retrieving the more precise one; Hence ITF (Inverse Term Frequency) & TW (Term Weighting) are used
• There is a need to change and rewrite the values of terms occurring in more than 50% of documents in the collection with 0 values and made them as ‘stop terms’ and documents with more than 50% of the index terms also receive a similar treatment
• Scaling down the weights with high CF is necessary
• The value/ weight of term that grows with its CF is reduced by a factor and that is IDF (Inverse Document Frequency)
• The formula used for the purpose is IDFt = Log N / DFt where N is the number of documents in the collection, DFt is the document frequency of term t and IDFt is the inverse document frequency of term t