Is it better to create a vocabulary from most frequent words or most valuable words in terms of TFIDF values on train corpus?

More Eren Golge's questions See All

Where can I find a large set of random images in order to use as a negative set for a problems of learning particular visual category?

I have only the positive class images right now and I am trying to come up with a large set of random images as a negative set. However, what would be a great resource for such bunch of images? Is...

06 July 2014 9,257 1 View

What are the state of the art methods for tracking many multi objects with static camera videos?

Example cases can be football or basketball matches.

05 June 2014 5,332 6 View

What is the current state of the art methods tracking human detection and segmentation in static images?

I am looking for state of the art methods to capture the human segments of images (mostly clothing images that are including only one person with natural a background, for example...

05 June 2014 8,977 3 View

How would I select L1 SVM to L2 SVM?

What are the criterion of the given dataset that makes it suitable for L1 SVM instead of L2? I know that it is favorable to use large dimensional features with L1 SVM to utilize its implicit...

04 May 2014 6,753 2 View

What are the present works about learning vector quantization (LVQ) methods?

I hope to find some recent works on LVQ a.k.a supervised SOM networks and successful applications.

11 December 2013 366 4 View

What are anomaly detection benchmark datasets?

I would like to experiment with one of the anomaly detection methods. What dataset could be a good benchmark?

11 December 2013 3,210 11 View

What are the present achievements and improvements in the name of Kohonen's Map a.k.a. Self Organizing Maps?

I am looking of the new trend SOM researches. Up to your knowledge, would you suggest some sources or upcoming ideas for SOM development?

10 November 2013 4,973 4 View

Do you know any work on Attribute Annotated subset of ImageNet?

http://www.image-net.org/download-attributes There is attribute annotated images of imagenet. I expect to use them in my research but being a comparison I am also looking for other stuffs made on...

09 October 2013 4,991 1 View

What are some datasets for indoor outdoor image classification ?

I am looking for a dataset to test my possible solution for being comparative with the former methods. What dataset could be a great candidate?

09 October 2013 1,823 4 View

For modelling a group of color classifiers how can I collect negative set instances instead of using other classes as in 1 vs All approach?

I aim to have a large portion of negative instances for my SVM models. However I cannot get any random set to use it as negative since in those sets some instances include those colors as well....

08 September 2013 1,673 3 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

"A Markov-like Model for Patient Progression"?

A Markov-like Model for Patient Progression" Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC) is a powerful computational technique used to draw samples from a probability...

05 August 2024 10,079 0 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

How to develop an academic literacy program for engineering at the higher education level?

Information literacy in higher education integration with curricula engineering

04 August 2024 5,368 3 View

Nicholas Evangelopoulos Popular answer

The classic approach to term filtering in information retrieval is to filter out terms at both ends. Terms that are very infrequent (rare terms) do not contribute much to concept-driven language patterns and are typically removed from the analysis through some kind of frequency filtering (e.g., remove all words that appear in fewer than 5 documents in the collection). This technique keeps the total number of terms to something like 5,000-10,000. Similarly, terms that are very frequent (trivial terms) are also removed as stopwords. These are typically about 500, this reduces the number of terms to about 4,500-9,500. A third technique is term stemming, where words with a common root are conflated. This may reduce the term size to about 2,000-6,000 stemmed terms. More drastic reductions (down to 200-1,000 terms) require the identification of the most USEFUL terms. This is a harder problem. One approach is to use the TF-IDF or similar function (e.g. Log-Entropy). Another approach is to identify terms that are highly associated with high-order principal components, after a preliminary run of Latent Semantic Analysis.

Domingo López-Rodríguez

I think it depends on the goal you pursue. If it is, for example, text compression, I would take the most frequent words, since with those I would achieve a greater compression rate. If your goal is, say, document clustering, I'd use TFIDF-based vocabulary, since these values offer the possibility to define "similarity" between documents.

HTH,

Domingo

Pradeep Gupta

You have not mentioned about the purpose of vocabulary. If you are designing a some kind of training set then it is worthwhile to have the valuable words and for designing a hacking related programs you have to go for complete vocabulary including valuable, most frequent etc.....

Antonio Foncubierta

Since you tagged the question with "Information Retrieval" I assume that is your purpose. There is a trade-off between descriptive words (frequent and meaningful) and discriminative words (not frequent and meaningful).

The meaningfulness concept is quite important: words in closed grammatical clases (prepositions, articles, adverbs) contain much less meaning and are usually discarded (stop words).

TF-IDF tries to model this two concepts: for a document, frequent words are weighted higher; however, if these words are frequent in all documents, then their weight is lowered.

Even if the Information retrieval purpose is clear, you still need to think about favouring precision or recall and the vocabulary choice may have an impact on that.

I would finally suggest to perform some experiments on a training set and then decide what works best for your use case.

Nicholas Evangelopoulos

Ian Kennedy

@Erin: Please remember that phrases are more powerful tools than words are, so concentrate on analysing them.