What is the meaning/word of each feature in 20 newsgroups dataset?

More Miao Yu's questions See All

How do hidden units represent visible units? What's the real meaning/relationship between hidden and visible units, when using RBM model?

When we try to use RBM as generative model to classify or analysis data. After long time of training, the model can learn the features of primary data very well, how could I select the significant...

10 November 2014 306 5 View

How can I get the normal sample of TCGA data?

Dear everyone, I want to analyse the mutation genes in Pan-TCGA, provided in the supplementary of Mutational landscape and significance across 12 major cancer types (The dataset is converted in...

10 November 2014 7,974 9 View

How to portion the number of units and samples when we using a RBM model?

When we using a RBM generative model for extracting feature, for the better performance and more accurate learning features, how to portion the number of hidden units, input dimension and quantity...

10 November 2014 5,623 3 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

Michael Niemann

It is not clear what you are after. It is easy to write a script to extract the terms for yourself (or I am sure you can find one in the NLTK kit or something similar). Do you want the root form of the terms (e.g. "sing" for "sang") or will stemming do? The first is a little hard, though you can use a part-of-speech tagger and Wordnet to help you. The second is easy using Porter stemmer code (though without part-of-speech it can be wrong sometimes, e.g. "building").

If you want to exclude the stopwords from the vocab file, why not just sort them both alphabetically and run a difference script to compare them?

Another thought might be to use tfidf (term frequency, inverse document frequency) to score each term according to each newsgroup. That should minimise the influence of stopwords and common terms and help identify terms that are more common/important to a particular group than other newsgroups.

Miao Yu

Thanks Michael very much.

I have use Prorter steemer code and stopword to filter the vocabulary corpus. But what annoy me most is many words like 'pb, se, ns, vm, ...', and so on, are still contained in the vocabulary corpus. I don't quite understand the meanings of those words, maybe because my native language is not English.

I don't know them from the 20 Newsgroups corpus, which I have used. They don't mean anything to me as abbreviations for anything in particular (unlike 'cf', 'etc', 'ie' which relate to Latin terms used in English). I had a quick look in the corpus for 'pb', 'se', 'ns' and 'vm' but couldn't find them, except for "per se" which is Latin for "by itself". Can you give examples? Has your tokenising divided terms like ns1 or web addresses like ericsson.se?

I download the 20 newsgroups of Matlab/Octave dataset(http://qwone.com/~jason/20Newsgroups/20news-bydate-matlab.tgz) and the vocabulary for the indexed data(http://qwone.com/~jason/20Newsgroups/vocabulary.txt) from the official website.

Then I shrink the dimension of words for input data from 53975 to 5000, by filtering out the words with low frequency. After that, there still some abbreviations of words like 'pb', 'se', ...(Seeing in Ps), in that corpus. So I think there are somethings still need to be done to make this corpus more meaningful.

Any other methods will help me out? Or if it is appropriate for you to share your relevant works or methods with me?

Ps:

the abbreviation words are list as blow(the top 200 words(length==2), order by its frequcency),

ve ca ll cs mr uk mb st ac pc de cc hp db ms au os al cd ed la dr tv ii ma se gm dc pa pp ny fi ps dx cb id nl xt ad nj nt pm ns em hd ch um il bu ra md si du oz mi po bc ai ti pt br ab lc bh tm uu sw di gt tx wa va aa rs hr km ei uh mm sp nz jr ee el tu op io bd ld sf ha wd ux ph sg sc mc sx kb es rc ah hz rm dg gl ne ip mu xv ql cx ap vt ss mg eh sb ga er mx ds iv ta rz gb tb fr sh ya wc mt pd ic jp gs tt sj ye bm rf cl bs da nc bb le fm wb xx ot sl lw je en su xm lq mo wg ba mp cf dl cr bi ft te qu rt sr gc za ge eu kg ls ut zx dd im vl ct cg dk vm rr nh dy oo nm kk ff sq pl mk hc bx jb pb

I think that the problem is that urls have been divided up into their individual elements. Therefore "se" is actually from web addresses from Sweden (.se). Similarly "jp" is for Japanese websites (.jp) and "fr" is from French ones (.fr). This is the problem with not doing the tokenisation yourself - you don't control what is considered a term. While you could search for a list of all these url terms and use it to filter your list, there will be some abbreviations that are actually valid, e.g. "tv" for "television" and "uk" for "U.K.", or the United Kingdom of Great Britain and Northern Ireland and "mr" for "Mr." or Mister. Personally, I think you would be best finding software that enables you to do the tokensiation yourself and modify it to match what you consider to be a word. Have a look at sites like http://www.nltk.org/ (Python scripts).

I am afraid that while my work on the 20 newsgroups did include tokenisation, my output was specialised for my research. I can tell you though that my tokeniser was based on the tokeniser.sed that was used for the Penn Treebank - http://www.cis.upenn.edu/~treebank/tokenizer.sed .

Dear Michael, thanks for your explanation.

I have checked the NLTK and the script from Penn Treebank provided by you, and done some research about tokensiation online.

I think it is not necessary to do the tokensiation by myself. I just need shrink the dataset(http://qwone.com/~jason/20Newsgroups/vocabulary.txt) by matching it with a-priori dictionary, which will filter those abbreviation words automatically. Is that appropriate?

But I think the best way to handle this problem is replacing those abbreviation words by its real meaning, for example, replacing 'se' by sweden, 'tv' by television etc, rather than just deleting those words without any consideration.

That certainly is one way. Watch out for terms like "us". Unless you know its original usage, you won't know whether it was "U.S." or "US" or "us" (i.e. first person plural pronoun which will be a stopword).

OK, thanks for your reminder. It's very nice of you.